Disappearing GPUs

tuthmose · June 13, 2023, 5:43pm

Hi,

I have installed a PBS server and enabled the use of CGroups v. 1 on Rocky Linux 8.8. The system seemed to work fine for a while but then, some nodes started to show no GPUs when a session is opened. Restarting the MoM temporarily solves the problem.

I enabled the ngpus resources on the scheduler and defined the relevant sessions of the hook are:

    "discover_gpus"         : true,
    "manage_rlimit_as"      : false,
    "nvidia-smi"            : "/usr/bin/nvidia-smi",

        "cpuset" : {
            "enabled"            : true,
            "exclude_cpus"       : [],
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "mem_fences"         : false,
            "mem_hardwall"       : false,
            "memory_spread_page" : false
        },
        "devices" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "allow"              : [
                "b *:* rwm",
                "c *:* m",
                "c 195:* m",
                "c 136:* rwm",
                ["knem","rwm"],
                ["fuse","rwm"],
                ["net/tun","rwm"],
                ["tty","rwm"],
                ["ptmx","rwm"],
                ["console","rwm"],
                ["null","rwm"],
                ["zero","rwm"],
                ["full","rwm"],
                ["random","rwm"],
                ["urandom","rwm"],
                ["cpu/0/cpuid","rwm","*"],
                ["nvidia-modeset", "rwm"],
                ["nvidia-uvm", "rwm"],
                ["nvidia-uvm-tools", "rwm"],
                ["nvidiactl", "rwm"],
                ["xpmem", "rwm"]

I had to add xpmem for a requirement of nvidia-sdk compiler. I have the following doubts:

I enabled cpuset after a few trials in which I was unable to isolate GPUs; why is that?
Are the settings correct?
May xpmem be the culprit?

thank you in advance for your help.

best regards

adarsh · June 14, 2023, 12:08pm

Your settings are correct.
xpmem might be the culprit (if everything worked after adding it ), as it was not in the allowed list of devices.

tuthmose · June 14, 2023, 12:24pm

Hi adarsh,

thank you for your reply. I have removed xpmem, will see if now the system is stable. While digging the problem I noticed that some nodes do not have the nvidia-uvm-tools and nvidia-uvm devices which should be present. I am trying to see if the two things correlate and will update.

Also, I wonder if setting 'cpuset=enablewith"vnode_per_numa_node" : true` may be an issue; what are your thoughts on it?

adarsh · June 14, 2023, 1:02pm

For GPU isolation: “vnode_per_numa_node” : true` , is correct.
cpuset=enabled is also correct.
They would not cause any issues

tuthmose · June 14, 2023, 4:25pm

Hi,

thank you. I have now forced the creation of all the devices on all nodes, will see if this could be (part of the) problem.

tuthmose · June 15, 2023, 9:14am

Hi Adarsh,

apparently, setting the correct devices on all nodes and eliminating xpmem did not solve the problem. The system worked for a while and then the problem arised once again.
If I look into a node with the problem I see that the corresponding vnodes have no GPUs available while they should have two.
Any ideas?

tuthmose · July 4, 2023, 12:29pm

Hi,

apologies for the late reply but I thought it could be useful to others. To sum up: we had this cluster (with users running jobs hence restart of PBS server or stuff like that is extrema ratio) in which GPUs “disappear” from vnodes after some hours even if they were when the nodes are created. Restarting the MOM solves the problem but only for the same few hours. Looking into the problem I thought it could be a cgroup definition issue and tried:

disabling xpmem (it seemed to be a requirement for a specific application)
disabling cpuset (which is kind of no strictly necessary as most user allocate one or two vnodes)
executing mknod to be sure that the same set of nvidia devices was present on all nodes

none of these things did solve the problem initially; in the end, the cluter became stable. What I did is fixing a typo in sched_priv/config (missing quotes in the definition of a licensed resource), checking again all the nodes (since the problem did not raise simultaneously and there were users running) and disable the infiniband/ucm0 resources which were not present.
I do not think that these last things had really any impact so probably the problem was in one of the settings above. As soon as possible I’ll set up a testing environment but it seemed polite to tell the story so far.
If you have any thoughts or questions, please share.

Best regards

tuthmose · August 11, 2023, 12:53pm

Hi Adarsh,

just to let you know that, taking advantage of a technical stop, I restored cpuset and xpmem in the CG configuration and the system remained stable. It was definetly a problem with the nodes and mknod.

Topic		Replies	Views
Advanced GPU Scheduling Developers	8	85	July 15, 2025
Specify which GPU to be used in vnode Users/Site Administrators	7	989	July 23, 2021
GPU Access Limited by CGroup Users/Site Administrators	14	8408	June 13, 2018
GPU memory as a custom resource Users/Site Administrators	6	3122	January 15, 2018
How to configure GPU resource within PBSPro Users/Site Administrators	13	11159	January 7, 2020

Disappearing GPUs

Related topics