Advanced scheduling requires defining a child vnode for each GPU。
I have two nodes, each with 18 CPUs and 4 GPUs. Is it necessary to create vnodes to use the advanced GPU scheduling resource call limitations? I just want to treat the node as a whole without subdividing it.
-
There is no need so create vnodes to schedule jobs on to a host with multiple GPU cards
The user can submit jobs by using the CUDA_VISIBLE_DEVICE varailable and use specific GPUs. -
If you would like GPU device isolation then you would need to use cgroups hook and it will take care of mapping the correct CUDA_VISIBLE_DEVICE and/or UID of the gpu card.
1、Why does pbs_cgroups automatically create vnodes for me when I use it, and also reduce my available memory and CPU for this node? Is vnode necessary when using cgroups?
2、When I use -l select=1:ngpus=1
, it will use 1 cpu at the same time. How can I only use gpu without cpu?
Please check the documentation on vnode_per_numa_node and allow_zero_cpus
These settings would help you achieve your requirement.
Thanks for your reply, Mr.adarsh. I set the vnode_per_numa_node:false, it doesn’t create vnode automatically, but it also reduced my memory and CPU. I originally had 4gb of memory and 4 CPUs. After enabling pbs_cgroups, the available memory of the node was only 3gb and the available CPU was only 2. It’s a problem. How can I solve it?
Thank you @wakaka
Please share your pbs_cgroups.json file.
- did you remove the vnodes and natural node after setting vnode_per_numa_node: to false
- and added the same node again, it would have created only one natural node
- Also the actual mem and cpu available on the node
- pbsnodes -av output
Thank you for sharing your cgroup configuration
- memory subsystem set to false
- not sure how this can happen
reserved_amount: 1GB: This reserved memory decreases the resources_available.mem that MoM advertises to the server as being available for each vnode, and also reduces the amount of memory the cgroups hook will assign to jobs
Also, not sure how the ncpus got decreased
Probably, better to disable the cgroups hook, delete the node and add the node. Make sure this is displaying the correct mem and cpus. enable the cgroups and then check the configuration of your node.
Thank you very much for your reminder, Mr.adarsh. The memory problem is due to the setting of reserved_amount, and the problem of the number of CPUs is due to hyperthreading.