Multiple cgroups per vnode -- realistic use cases?

How realistic a use case is this? Are there any realistic, even remotely common use cases like this, such that there would be a need for multiple cgroups per vnode? Is it worth pursuing?

The only viable use case I see for this is supporting an mpirun/mpiexec where the target code/script/binary/program does not [and can not otherwise be made to] coordinate resource utilization between multiple instances of itself on the same vnode.

For [an admittedly unrealistic] example, assume you have the usual cgroups/ngpus setup and a user selects 4:ncpus=1:ngpus=1, no placement, which all fits on a single vnode. The user runs ‘mpirun -np 4’ into that and the usual mpi/pbs/cgroups/ngpus integration is working. This should give them a job on one vnode in 1 cgroup that owns 4 cpus and 4 gpus, on which runs 4 mpi ranks. Just for arguments sake let’s say the user’s code counts available gpus by running ‘nvidia-smi’, doesn’t communicate what that reports to other ranks, and can’t be modified. (Just go with it, I did say it was unrealistic.) So all 4 mpi ranks each report the same 4 GPUIDs. Only if each rank run on a unique vnode will each mpi rank of user code report exactly 1 unique GPUID. Are there realistic cases like this? Are they at all common?

I don’t have a good answer for you, but I think it is an interesting question. Here are some random musings about this:

  • I assume you mean multiple cgroups under the control of PBS? Otherwise, run two containers and you have multiple cgroups in a vnode.
  • To date, we have not shared nodes, but we are considering it. One of the things we worry about is isolation between users and you could try cgrouping each user job in a shared vnode. We have done something like this on big login nodes to keep one user from monopolizing it.
  • I am not sure your example is that unrealistic. We have run across codes that assumed there was only one GPU and it was theirs so they all addressed GPU0.
  • We have a couple of use cases we are going to experiment with. In our case, we are going to use the PBS cgroups hook to create several vnodes and we think the alignment and granularity is good enough, but one could imagine doing this with cgroups within a vnode:
    • We have nodes that have one 32 core Milan processor and four A100 GPUs. So far, we always do whole node allocations, but many jobs can’t take advantage of four GPUs. The Rome and Milan processors have a BIOS setting called NPS (NUMA Per Socket). If we set that to 4 and set the PBS cgroups hook to do vnode per NUMA, then we get four vnodes, each with eight cores and one A100. This has the potential to avoid wasting lots of resources. Again, this isn’t multiple cgropus in a vnode, but if your processor didn’t have the NPS capability, you might want to do something similar.
    • Similarly, we have a small testbed with (2) 32 core Rome processors and (2) A100s. Right now, we cgroups configured so that it looks like two vnodes, each with one 32 core proc and one A100. However, we are testing sharing on those nodes and one of the groups we are working with still has a lot of CPU only codes. We are considering setting NPS=4 which will make it look like eight vnodes. Each vnode will have 8 cores and 1/8th of the RAM, but two of them will also have an A100. Then we can route CPU only jobs to the vnodes with only CPUs.

I am not sure if that helps, but I am curious what others might think.