Disappearing GPUs

Hi,

I have installed a PBS server and enabled the use of CGroups v. 1 on Rocky Linux 8.8. The system seemed to work fine for a while but then, some nodes started to show no GPUs when a session is opened. Restarting the MoM temporarily solves the problem.

I enabled the ngpus resources on the scheduler and defined the relevant sessions of the hook are:

    "discover_gpus"         : true,
    "manage_rlimit_as"      : false,
    "nvidia-smi"            : "/usr/bin/nvidia-smi",
        "cpuset" : {
            "enabled"            : true,
            "exclude_cpus"       : [],
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "mem_fences"         : false,
            "mem_hardwall"       : false,
            "memory_spread_page" : false
        },
        "devices" : {
            "enabled"            : true,
            "exclude_hosts"      : [],
            "exclude_vntypes"    : [],
            "allow"              : [
                "b *:* rwm",
                "c *:* m",
                "c 195:* m",
                "c 136:* rwm",
                ["knem","rwm"],
                ["fuse","rwm"],
                ["net/tun","rwm"],
                ["tty","rwm"],
                ["ptmx","rwm"],
                ["console","rwm"],
                ["null","rwm"],
                ["zero","rwm"],
                ["full","rwm"],
                ["random","rwm"],
                ["urandom","rwm"],
                ["cpu/0/cpuid","rwm","*"],
                ["nvidia-modeset", "rwm"],
                ["nvidia-uvm", "rwm"],
                ["nvidia-uvm-tools", "rwm"],
                ["nvidiactl", "rwm"],
                ["xpmem", "rwm"]

I had to add xpmem for a requirement of nvidia-sdk compiler. I have the following doubts:

  • I enabled cpuset after a few trials in which I was unable to isolate GPUs; why is that?
  • Are the settings correct?
  • May xpmem be the culprit?

thank you in advance for your help.

best regards

Your settings are correct.
xpmem might be the culprit (if everything worked after adding it ), as it was not in the allowed list of devices.

Hi adarsh,

thank you for your reply. I have removed xpmem, will see if now the system is stable. While digging the problem I noticed that some nodes do not have the nvidia-uvm-tools and nvidia-uvm devices which should be present. I am trying to see if the two things correlate and will update.

Also, I wonder if setting 'cpuset=enablewith"vnode_per_numa_node" : true` may be an issue; what are your thoughts on it?

For GPU isolation: “vnode_per_numa_node” : true` , is correct.
cpuset=enabled is also correct.
They would not cause any issues

Hi,

thank you. I have now forced the creation of all the devices on all nodes, will see if this could be (part of the) problem.

Hi Adarsh,

apparently, setting the correct devices on all nodes and eliminating xpmem did not solve the problem. The system worked for a while and then the problem arised once again.
If I look into a node with the problem I see that the corresponding vnodes have no GPUs available while they should have two.
Any ideas?

Hi,

apologies for the late reply but I thought it could be useful to others. To sum up: we had this cluster (with users running jobs hence restart of PBS server or stuff like that is extrema ratio) in which GPUs “disappear” from vnodes after some hours even if they were when the nodes are created. Restarting the MOM solves the problem but only for the same few hours. Looking into the problem I thought it could be a cgroup definition issue and tried:

  • disabling xpmem (it seemed to be a requirement for a specific application)
  • disabling cpuset (which is kind of no strictly necessary as most user allocate one or two vnodes)
  • executing mknod to be sure that the same set of nvidia devices was present on all nodes

none of these things did solve the problem initially; in the end, the cluter became stable. What I did is fixing a typo in sched_priv/config (missing quotes in the definition of a licensed resource), checking again all the nodes (since the problem did not raise simultaneously and there were users running) and disable the infiniband/ucm0 resources which were not present.
I do not think that these last things had really any impact so probably the problem was in one of the settings above. As soon as possible I’ll set up a testing environment but it seemed polite to tell the story so far.
If you have any thoughts or questions, please share.

Best regards

Hi Adarsh,

just to let you know that, taking advantage of a technical stop, I restored cpuset and xpmem in the CG configuration and the system remained stable. It was definetly a problem with the nodes and mknod.

1 Like