GPU installed on our HPC's cnode02 not active and not showing up

Dear Team,

We have installed and configured HPC where one Master node and two compute nodes have been configured and installed and only one node i.e node2 has the NVidia GPU(NVIDIA A100 80GB) card installed in it. However, the activity status of the card is not reflected while checking using the “pbsnodes -aSj” command; please find the output below:

[root@nitcshpcmn ~]# pbsnodes -aSj
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs


cnode01 free 0 0 0 251gb/251gb 104/104 0/0 0/0 –
cnode02 free 0 0 0 251gb/251gb 104/104 0/0 0/0 –

but showing the graphics card details on compute node.

[root@Masternode ~]# ssh cnode02
[root@cnode02 ~]# nvidia-smi
Thu Feb 15 17:24:09 2024
±--------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------±---------------------±---------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000000:CA:00.0 Off | 0 |
| N/A 47C P0 65W / 300W | 4MiB / 81920MiB | 24% Default |
| | | Disabled |
±----------------------------------------±---------------------±---------------------+

±--------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
±--------------------------------------------------------------------------------------+
[root@cnode02 ~]#

Kindly guide me how to resolve this issue.

Thanks!

Please follow the below disucssion:

Note:

  • you would need to configure gpu card as a custom resource
  • like cpu, mem , the gpu card would not be monitored, however you can customise it by using mom periodic hook to update the gpu statistics via custom resources attached to compute nodes .
  • check the cgroups configuration on GPU

Ref: https://help.altair.com/2022.1.0/PBS%20Professional/PBS2022.1.pdf , section 5.14.7 Using GPUs