Hi,
I have installed a PBS server and enabled the use of CGroups v. 1 on Rocky Linux 8.8. The system seemed to work fine for a while but then, some nodes started to show no GPUs when a session is opened. Restarting the MoM temporarily solves the problem.
I enabled the ngpus
resources on the scheduler and defined the relevant sessions of the hook are:
"discover_gpus" : true,
"manage_rlimit_as" : false,
"nvidia-smi" : "/usr/bin/nvidia-smi",
"cpuset" : {
"enabled" : true,
"exclude_cpus" : [],
"exclude_hosts" : [],
"exclude_vntypes" : [],
"mem_fences" : false,
"mem_hardwall" : false,
"memory_spread_page" : false
},
"devices" : {
"enabled" : true,
"exclude_hosts" : [],
"exclude_vntypes" : [],
"allow" : [
"b *:* rwm",
"c *:* m",
"c 195:* m",
"c 136:* rwm",
["knem","rwm"],
["fuse","rwm"],
["net/tun","rwm"],
["tty","rwm"],
["ptmx","rwm"],
["console","rwm"],
["null","rwm"],
["zero","rwm"],
["full","rwm"],
["random","rwm"],
["urandom","rwm"],
["cpu/0/cpuid","rwm","*"],
["nvidia-modeset", "rwm"],
["nvidia-uvm", "rwm"],
["nvidia-uvm-tools", "rwm"],
["nvidiactl", "rwm"],
["xpmem", "rwm"]
I had to add xpmem
for a requirement of nvidia-sdk compiler. I have the following doubts:
- I enabled
cpuset
after a few trials in which I was unable to isolate GPUs; why is that? - Are the settings correct?
- May
xpmem
be the culprit?
thank you in advance for your help.
best regards