I’ve set up Advanced Scheduling for GPUs as recommended by the 18.2 Admin Guide and the scheduling of the GPUs seems to be working, the environment variable CUDA_VISIBLE DEVICES is not being set, which I think it should be as indicated by section 16.5.1 of the Admin guide.
I’ve also setup a hook of the following form that is set to run on the exit
Thanks for your questions. The message you’re seeing about libmemacct is benign. If you were on an Altix system where libmemacct.so is present, you wouldn’t see it. It may safely be ignored.
The cgroup hook sets CUDA_VISIBLE_DEVICES in the job’s environment. Your configuration looks correct, but some versions of nvidia-smi report information about the device IDs differently. There is another thread where this was discussed here: GPU Access Limited by CGroup
Take a look at the pbs_mom logs in /var/spool/pbs/mom_logs and see if there is anything helpful there. If not, you can increase the verbosity of the logs by adding a line to /var/spool/pbs/mom_priv/config that looks like this: $logevent 0xffff
You will need to restart pbs_mom so that it rereads its configuration. Try running another job and see if the logs provide any clues. Feel free to post excerpts here if you need additional help.
Glad to help. While you found the directory where the hooks are stored, you should not modify those files. You must use qmgr to import/export hooks and manage their configuration so that the changes get propagated to your execution hosts where pbs_mom runs. See chapter 3 of the PBS Pro admin guide located here: https://www.pbsworks.com/SupportGT.aspx?d=PBS-Professional,-Documentation
There should be a pbs_cgroups hook already present…
# qmgr -c “list hook”
Hook pbs_cgroups
type = site
enabled = false
event = execjob_begin,execjob_epilogue,execjob_end,execjob_launch,
execjob_attach,
exechost_periodic,
exechost_startup
user = pbsadmin
alarm = 90
freq = 120
order = 100
debug = false
fail_action = offline_vnodes
The cgroups hook configuration file you listed in your initial post lead me to believe you were using the cgroups hook supplied with the PBS Pro package. I made a poor assumption. In this case, if all you want to do is set CUDA_VISIBLE_DEVICES the cgroups hook may be overkill. You may still want to take a look at how it gets set in the cgroups hook. Search for CUDA_VISIBLE_DEVICES and you’ll see the commands that add it to the environment for the job. If you do choose to use the cgroups hook, it is documented in chapter 16 of the admin guide.
You should be able to take the current cgroups hook from master and run it on a 14.x installation, but you will need to import it as though you wrote it yourself. And don’t forget to import the hook configuration file as well.