Hi, I am fairly new to the OpenPBS, I try to follow the documentation, and I think I got CPU jobs working on non GPU nodes.
What I can’t wrap my head around are GPU jobs.
I have multiple nodes with dual CPU and 4 GPUs.
Expected behaviour: User submits job to queue, node with free resources is found, and job is started. If no free resources are found, job is queued.
What actually happens: User submits job for GPU, 1 node with any number of GPUs is used, every other job is marked as Held even tho I have free resources, be it on the 1 node or any other node.
Sorry for long post, but I need to resolve this issue and cannot find where the problem is.
Below is what my config looks like, maybe I am doing this all wrong.
If any other info is needed I will be happy to share it.
What I did / My config
Queues
create queue enp5
set queue enp5 queue_type = Execution
set queue enp5 resources_max.walltime = 240:00:00
set queue enp5 resources_default.mem = 8gb
set queue enp5 resources_default.nice = -5
set queue enp5 resources_default.walltime = 01:00:00
set queue enp5 default_chunk.Qlist = enp5
set queue enp5 enabled = True
set queue enp5 started = True
create queue gpu2080
set queue gpu2080 queue_type = Execution
set queue gpu2080 resources_max.ngpus = 4
set queue gpu2080 resources_min.ngpus = 1
set queue gpu2080 resources_default.walltime = 168:00:00
set queue gpu2080 default_chunk.Qlist = gpu2080
set queue gpu2080 enabled = True
set queue gpu2080 started = True
Master node
Export of cgroups
qmgr -c "export hook pbs_cgroups application/x-config default" >pbs_cgroups.json
Edit cgroups
nano pbs_cgroups.json
there I edited:
devices{
"enabled" : true,
"c : rwm",
["nvidiactl", "rwm", "*"]
}
Import cgroups
qmgr -c "import hook pbs_cgroups application/x-config default pbs_cgroups.json"
qmgr -c "set hook pbs_cgroups enabled=True"
compute nodes
cd /var/spool/pbs/mom_priv/config.d/
nano node12_vnodes
$configversion 2
node12: resources_available.ncpus = 0
node12: resources_available.mem = 0
node12[0]: resources_available.ncpus = 18
node12[0]: resources_available.mem = 48800mb
node12[0]: resources_available.ngpus = 1
node12[0]: resources_available.gpu_id = gpu0
node12[0]: sharing = default_excl
node12[0]: resources_available.Qlist = enp5
node12[0]: resources_available.Qlist += gpu2080
node12[1]: resources_available.ncpus = 18
node12[1]: resources_available.mem = 48800mb
node12[1]: resources_available.ngpus = 1
node12[1]: resources_available.gpu_id = gpu1
node12[1]: sharing = default_excl
node12[1]: resources_available.Qlist = enp5
node12[1]: resources_available.Qlist += gpu2080
node12[2]: resources_available.ncpus = 18
node12[2]: resources_available.mem = 48800mb
node12[2]: resources_available.ngpus = 1
node12[2]: resources_available.gpu_id = gpu2
node12[2]: sharing = default_excl
node12[2]: resources_available.Qlist = enp5
node12[2]: resources_available.Qlist += gpu2080
node12[3]: resources_available.ncpus = 18
node12[3]: resources_available.mem = 48800mb
node12[3]: resources_available.ngpus = 1
node12[3]: resources_available.gpu_id = gpu3
node12[3]: sharing = default_excl
node12[3]: resources_available.Qlist = enp5
node12[3]: resources_available.Qlist += gpu2080
/opt/pbs/sbin/pbs_mom -s insert node12_vnodes node12_vnodes
Restart MoM
/etc/init.d/pbs stop
/etc/init.d/pbs stop