Multiple cgroups per vnode -- realistic use cases?

I don’t have a good answer for you, but I think it is an interesting question. Here are some random musings about this:

  • I assume you mean multiple cgroups under the control of PBS? Otherwise, run two containers and you have multiple cgroups in a vnode.
  • To date, we have not shared nodes, but we are considering it. One of the things we worry about is isolation between users and you could try cgrouping each user job in a shared vnode. We have done something like this on big login nodes to keep one user from monopolizing it.
  • I am not sure your example is that unrealistic. We have run across codes that assumed there was only one GPU and it was theirs so they all addressed GPU0.
  • We have a couple of use cases we are going to experiment with. In our case, we are going to use the PBS cgroups hook to create several vnodes and we think the alignment and granularity is good enough, but one could imagine doing this with cgroups within a vnode:
    • We have nodes that have one 32 core Milan processor and four A100 GPUs. So far, we always do whole node allocations, but many jobs can’t take advantage of four GPUs. The Rome and Milan processors have a BIOS setting called NPS (NUMA Per Socket). If we set that to 4 and set the PBS cgroups hook to do vnode per NUMA, then we get four vnodes, each with eight cores and one A100. This has the potential to avoid wasting lots of resources. Again, this isn’t multiple cgropus in a vnode, but if your processor didn’t have the NPS capability, you might want to do something similar.
    • Similarly, we have a small testbed with (2) 32 core Rome processors and (2) A100s. Right now, we cgroups configured so that it looks like two vnodes, each with one 32 core proc and one A100. However, we are testing sharing on those nodes and one of the groups we are working with still has a lot of CPU only codes. We are considering setting NPS=4 which will make it look like eight vnodes. Each vnode will have 8 cores and 1/8th of the RAM, but two of them will also have an A100. Then we can route CPU only jobs to the vnodes with only CPUs.

I am not sure if that helps, but I am curious what others might think.