Advanced GPU Scheduling

wakaka · July 2, 2025, 7:13am

Advanced scheduling requires defining a child vnode for each GPU。
I have two nodes, each with 18 CPUs and 4 GPUs. Is it necessary to create vnodes to use the advanced GPU scheduling resource call limitations? I just want to treat the node as a whole without subdividing it.

adarsh · July 2, 2025, 4:28pm

There is no need so create vnodes to schedule jobs on to a host with multiple GPU cards
The user can submit jobs by using the CUDA_VISIBLE_DEVICE varailable and use specific GPUs.
If you would like GPU device isolation then you would need to use cgroups hook and it will take care of mapping the correct CUDA_VISIBLE_DEVICE and/or UID of the gpu card.

wakaka · July 10, 2025, 9:14am

1、Why does pbs_cgroups automatically create vnodes for me when I use it, and also reduce my available memory and CPU for this node? Is vnode necessary when using cgroups?
2、When I use -l select=1:ngpus=1, it will use 1 cpu at the same time. How can I only use gpu without cpu?

adarsh · July 10, 2025, 5:42pm

Please check the documentation on vnode_per_numa_node and allow_zero_cpus
These settings would help you achieve your requirement.

wakaka · July 11, 2025, 9:57am

Thanks for your reply, Mr.adarsh. I set the vnode_per_numa_node:false, it doesn’t create vnode automatically, but it also reduced my memory and CPU. I originally had 4gb of memory and 4 CPUs. After enabling pbs_cgroups, the available memory of the node was only 3gb and the available CPU was only 2. It’s a problem. How can I solve it?

adarsh · July 11, 2025, 7:14pm

Thank you @wakaka

Please share your pbs_cgroups.json file.

did you remove the vnodes and natural node after setting vnode_per_numa_node: to false
and added the same node again, it would have created only one natural node
Also the actual mem and cpu available on the node
pbsnodes -av output

wakaka · July 14, 2025, 1:13am

This is my pbs_cgroups.json：

Yeah, I have tried it, remove first and add again, but it does’t work.

adarsh · July 14, 2025, 7:53pm

Thank you for sharing your cgroup configuration

memory subsystem set to false
not sure how this can happen
reserved_amount: 1GB: This reserved memory decreases the resources_available.mem that MoM advertises to the server as being available for each vnode, and also reduces the amount of memory the cgroups hook will assign to jobs

Also, not sure how the ncpus got decreased

Probably, better to disable the cgroups hook, delete the node and add the node. Make sure this is displaying the correct mem and cpus. enable the cgroups and then check the configuration of your node.

wakaka · July 15, 2025, 6:51am

Thank you very much for your reminder, Mr.adarsh. The memory problem is due to the setting of reserved_amount, and the problem of the number of CPUs is due to hyperthreading.

Topic		Replies	Views
Specify which GPU to be used in vnode Users/Site Administrators	7	1018	July 23, 2021
How to configure GPU resource within PBSPro Users/Site Administrators	13	11294	January 7, 2020
GPU memory as a custom resource Users/Site Administrators	6	3145	January 15, 2018
Multiple cgroups per vnode -- realistic use cases? Users/Site Administrators	1	875	January 29, 2022
CGroups hook: vnode_per_numa_node when NUMA bisects sockets Users/Site Administrators	2	1032	December 17, 2019

Advanced GPU Scheduling

Related topics