CPU + GPU jobs on nodes

Hi, I am fairly new to the OpenPBS, I try to follow the documentation, and I think I got CPU jobs working on non GPU nodes.

What I can’t wrap my head around are GPU jobs.

I have multiple nodes with dual CPU and 4 GPUs.

Expected behaviour: User submits job to queue, node with free resources is found, and job is started. If no free resources are found, job is queued.

What actually happens: User submits job for GPU, 1 node with any number of GPUs is used, every other job is marked as Held even tho I have free resources, be it on the 1 node or any other node.

Sorry for long post, but I need to resolve this issue and cannot find where the problem is.

Below is what my config looks like, maybe I am doing this all wrong.

If any other info is needed I will be happy to share it.

What I did / My config

Queues

create queue enp5
set queue enp5 queue_type = Execution
set queue enp5 resources_max.walltime = 240:00:00
set queue enp5 resources_default.mem = 8gb
set queue enp5 resources_default.nice = -5
set queue enp5 resources_default.walltime = 01:00:00
set queue enp5 default_chunk.Qlist = enp5
set queue enp5 enabled = True
set queue enp5 started = True
create queue gpu2080
set queue gpu2080 queue_type = Execution
set queue gpu2080 resources_max.ngpus = 4
set queue gpu2080 resources_min.ngpus = 1
set queue gpu2080 resources_default.walltime = 168:00:00
set queue gpu2080 default_chunk.Qlist = gpu2080
set queue gpu2080 enabled = True
set queue gpu2080 started = True

Master node

Export of cgroups

qmgr -c "export hook pbs_cgroups application/x-config default" >pbs_cgroups.json

Edit cgroups

nano pbs_cgroups.json

there I edited:

devices{
	"enabled" : true,
	"c : rwm",
	["nvidiactl", "rwm", "*"]
}

Import cgroups

qmgr -c "import hook pbs_cgroups application/x-config default pbs_cgroups.json"
qmgr -c "set hook pbs_cgroups enabled=True"

compute nodes

cd /var/spool/pbs/mom_priv/config.d/
nano node12_vnodes 
$configversion 2

node12: resources_available.ncpus = 0
node12: resources_available.mem   = 0

node12[0]: resources_available.ncpus = 18
node12[0]: resources_available.mem   = 48800mb
node12[0]: resources_available.ngpus = 1
node12[0]: resources_available.gpu_id = gpu0
node12[0]: sharing = default_excl
node12[0]: resources_available.Qlist = enp5
node12[0]: resources_available.Qlist += gpu2080

node12[1]: resources_available.ncpus = 18
node12[1]: resources_available.mem   = 48800mb
node12[1]: resources_available.ngpus = 1
node12[1]: resources_available.gpu_id = gpu1
node12[1]: sharing = default_excl
node12[1]: resources_available.Qlist = enp5
node12[1]: resources_available.Qlist += gpu2080

node12[2]: resources_available.ncpus = 18
node12[2]: resources_available.mem   = 48800mb
node12[2]: resources_available.ngpus = 1
node12[2]: resources_available.gpu_id = gpu2
node12[2]: sharing = default_excl
node12[2]: resources_available.Qlist = enp5
node12[2]: resources_available.Qlist += gpu2080

node12[3]: resources_available.ncpus = 18
node12[3]: resources_available.mem   = 48800mb
node12[3]: resources_available.ngpus = 1
node12[3]: resources_available.gpu_id = gpu3
node12[3]: sharing = default_excl
node12[3]: resources_available.Qlist = enp5
node12[3]: resources_available.Qlist += gpu2080
/opt/pbs/sbin/pbs_mom -s insert node12_vnodes node12_vnodes 

Restart MoM

/etc/init.d/pbs stop
/etc/init.d/pbs stop

If the job is in “H” state, one of the possible reasons might be

  • User’s home directory not mounted or not available on the compute nodes
  • Manually held jobs using qsub or qhold PBS
  • Job is dependent on other job(s)
  • Please check: server and mom logs
    [ tracejob < jobid > would show us the node that this job was intended to run.]