Reserving cores for GPU jobs

We have several nodes with 128 CPU cores and 2 GPU cards each, and we are trying to determine the best way to utilize these resources. If we isolate the whole node to a GPU queue and require jobs to use a GPU, then all 128 CPU cores sit idle when no GPU jobs are running, causing the CPU cores to be wasted. However, if we allow CPU-only jobs on these nodes, then all the CPU cores get consumed and when GPU jobs get submitted, they have to wait a very long time to run.

Therefore, we are wondering how we can configure the scheduler such that a certain number of cores on each GPU node are set aside for GPU computations (say 8 cores per GPU) while the remaining CPU cores can be used for general work. So far, we have considered the following options:

  • Splitting each node into 2 vnodes – one with the GPU cards and some portion of the cores, and the other with the remaining cores. We’re still working on this, but it’s a little ugly in the config.
  • Allow CPU jobs running on GPU nodes to be preemptible. This would work just fine, but users don’t like to use the preemptible queue we have for fear of their jobs being killed, so I don’t think this would improve utilization that much.

Are there any queue- or node-level directives we can use that basically say “This queue is only allowed to use X cores per node at a time”? I’m aware of max_run_res.ncpus = [o:PBS_ALL=X] but that is the total among all nodes in that queue. So a few nodes could still be completely consumed as long as the total is under X cores. So I guess what I’m asking for is something similar to that, but like [o:PBS_EACH=X] where X now refers to the max running CPUs for each node in the queue, rather than the total among all nodes.

Is this possible? If not, are there any other ways you can think of to do what we’re trying to do?

Take a look at my reply here. Depending on your processor, if it has multiple NUMA nodes, you might be able to configure PBS cgroups for “vnode per NUMA node” and end up with something along the lines you want. There are lots of reasons it might not work (you might only have one NUMA node, both GPUs could be on one NUMA node which is probably not what you want, etc) but you might get something useful using available configuration. Hope that helps.