How to configure GPU resource within PBSPro

guillaumeeb · March 8, 2019, 8:45am

Hi everyone,

So okay, I know there is documentation with two documented approaches:

Basic Scheduling: you declare a resource, but it is mainly for fun, as each job will exclusivly use a complete node no matter how many gpus are requested. So a dedicated queue would work almost the same.
Advanced Scheduling: You declare one vnodes per GPU, but in this case, is it possible to qsub –lselect=1:ncpus=1:ngpus=2 ?

In my cluster I have nodes with 4GPUS, I want to be able to have several individual jobs booking one GPU on one given node, but also to have jobs booking the entire node easily. Is this possible?

*** Edit ***

What I want is to be able to do qsub –lselect=1:ncpus=1:ngpus=2 or two times qsub –lselect=1:ncpus=1:ngpus=1 on the same node.

jon · March 9, 2019, 5:01am

I would recommend that you look into the cgroup hook. If I recall correctly it identifies nodes with GPUs and can set the resources on the server. It also makes sure that jobs that requests GPUs have access to them and jobs that just request the cores cannot access the GPUs.

guillaumeeb · March 9, 2019, 8:55am

Thanks @jon for the reply!

We’ve got problem in the past in trying using cgroup with PBS. We’re still on PBS 13 version, and with the hook we had and CentOs7, things were not working properly.

Isn’t there some hook to set CUDA_VISIBLE_DEVICE or something like that if we don’t want to use cgroup?

adarsh · March 11, 2019, 9:14am

Some solutions i can think of.

If you have 2 GPU cards on the system, create configv2 file, which creates 2 vnodes , assign CUDA_VISIBLE_DEVICE=0 to vnode[0] and CUDA_VISIBLE_DEVICE=1 to vnode[1] using a runjob hook
You need to write a execjob_launch (to set the correct CUDA_VISIBLE_DEVICE environment variable by querying underlying gpu cards) and execjob_end (to remove the correct variable assigned to CUDA_VISIBLE_DEVICE)

guillaumeeb · March 11, 2019, 10:15am

Thanks @adarsh!

The first solution you propose is closed to the advanced one from the docs. In this case, will it be possible to qsub –lselect=1:ngpus=2

adarsh · March 11, 2019, 10:34am

No, it would not support that request. Hence it has to be Solution 2 .

guillaumeeb · March 11, 2019, 3:06pm

Thanks again. Did anybody tried this solution? Is there some contribution that can be find somewhere using CUDA_VISIBLE_DEVICE into PBS hooks?

Other threads seem to mention cgroup hook as @jon: Trying to get CUDA_VISIBLE DEVICES set with hook, GPU Access Limited by CGroup.

Would this work on PBS version 13?

adarsh · March 11, 2019, 7:53pm

Yes, it will work with PBS Pro version 13 (as long as the cgroup supported kernel is used).

scc · March 12, 2019, 1:48pm

It may or may not work, depending on which version of the cgroups hook is being used with version 13. The latest 19.1 cgroups hook, for example, adds event handling for the new execjob_resize event, which will cause errors when used with version 18 or earlier where that event does not exist in the server etc. Further, PBS Pro 13.x used Python 2.5, later releases 2.7, and there are some Python 2.7-isms in more modern revisions of the cgroups hook.

While these are not insurmountable problems, there may be other unknown problems, this is not anything that is tested. Best bet is to upgrade to 18/19 to get a tested integration.

HTH.

yali · March 19, 2019, 1:34am

Hi everyone,

I’m also trying to configure a single node GPU cluster with 16 GPUs for advanced GPU scheduling, at the moment I’m reading through the Big Book.

I’ve summarized my understanding, before I start the configuration I would like to verify that I’m doing the right thing. Please correct me if I’m wrong:

1. Create custom resource
qmgr -c “create resource ngpus type=long, flag=nh”
qmgr -c “create resource gpu_id type=string, flag=h”

2. Create a vnode for each GPU resource in the machine (16 vnodes because the machine has 16 GPUS)

3. Edit the cgroup configuration file:
3.1 Export it first
qmgr -c “export hook pbs_cgroups application/x-config default” > pbs_cgroups.json.

3.2 Add the GPU ids
“devices” : {
“enabled” : true,
“exclude_hosts” : [],
“exclude_vntypes” : [],
“allow” : [
“b : rwm”,
“c 0:* rwm”,
“c 1:* rwm”,
“c 2:* rwm”,
“c 3:* rwm”,
“c 4:* rwm”,
(until 15)
“c 15:* rwm”,
[“nvidia-uvm”, “rwm”]
]
3.3 Import the modified cgroup file again
qmgr -c “import hook pbs_cgroups application/x-config default pbs_cgroups.json”

4 Run jobs
CUDA_VISIBLE_DEVICES will be automatically set by the PBS server, multiple users should be able to run jobs if the GPU request can be satisfied.

In section 3.2 is it correct that the GPUs must be numbered from 0 to 15?

Thank you for your help!

nickolastempel · December 30, 2019, 3:50pm

Yali, did your pbs_cgroups.json file work as expected? Also, did you need to create 16 vnodes first? Thanks!

yali · January 6, 2020, 4:07am

Sorry for the late answer. We actually did not test it. We used Slurm instead.

scott · January 6, 2020, 9:50pm

There is a white paper on PBS Professional, DGX, and GPUs… https://www.altair.com/resource/altair-pbs-professional-support-on-nvidia-dgx-systems. Might be helpful

yali · January 7, 2020, 12:34am

Thanks for the link to the whitepaper! Looks very interesting!

Topic		Replies	Views
Specify which GPU to be used in vnode Users/Site Administrators	7	978	July 23, 2021
Any updates on GPU support since 2010? Users/Site Administrators	4	1907	July 17, 2016
How get allocated gpus on each nodes Users/Site Administrators	11	2852	November 2, 2020
PBS Single exection host run job using cpu include gpu Users/Site Administrators	5	1787	May 8, 2021
GPU memory as a custom resource Users/Site Administrators	6	3109	January 15, 2018

How to configure GPU resource within PBSPro

Related topics