I am having 2 GPUs (Titan and V100) in my system with PBS 19 and CentOS 7 installed.
So i have created 2 vnodes, 1 vnode with V100 and to be part of gpuq and another vnode to be part of cpuq. But when we run the job in the gpuq, the jobs are using both the gpus.
So i have used CUDA_VISIBLE_DEVICES=1 in the job submission script and everything is working fine.
How can we update this conf to the vnode creation it self so that users will get rid of setting the cuda option in the script.
But after updating the parameters accordingly and when we are trying to run the job we are still observing that the job is trying to utilize both the GPUs.
pbsnodes -v c08[1]
c08[1]
Mom = c08
Port = 15002
pbs_version = 19.1.3
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.gpu_id = gpu1
In the job log we are able to find the below
On host c08 2 GPUs selected for this run.
Mapping of GPU IDs to the 6 GPU tasks in the 6 ranks on this node:
PP:0,PP:0,PP:0,PP:1,PP:1,PP:1
Can you please have a look into and let us know how can we restrict the gpu0.
Could you please share the job script that you are trying to run ?
In the above configuration you would need to make a bit more customisation in the Mom hooks , finding out to with gpu_id (which gpu device is being used with nvidia-smi) the job has landed onto, then then set the enviroment variable CUDA_VISIBLE_DEVICE=0 or 1 , based on the ngpus request in qsub select statement
Also to set the environment variable each time, can we set this according to the queue, to use device 1 for gpuq and nothing for cpuq.
Referring to the Cgroup hook, we are not having any option like “nvidia-smi” present in the file. Can we have any sample referring to the same if possible.
If we implement the hooks now, do we need to delete the vnodes again.
You are directly requesting that specific node in your jobs script.
It is better if we make PBS to choose that specific gpu resource via customer resources.
Also, i do not see you requesting cpu, gpu, mem resources from that system, this means the job can use as much as it wa
To be clear on my understanding:
You have one compute node with two GPU cards
Titan
V100
by any chance you know setting which CUDA_VISIBLE_DEVICE would use V100 and Titan respectively ?
I think we do not have create vnodes , instead we can use it as one natural node and based on the request by the user , set the enviroment variable (Variable_List) in the hook
qsub -l select=1:ncpus=1:ngpus=1 # if this is the request , the queuejob hook event will know that there is a gpu being requested, and hence it will set CUDA_VISIBLE_DEVICE to 0 or 1 ( based on QNo1)
qsub -l select=1:ncpus=1 # if this is the request, then the queuejob hook event will not set any CUDA_VISIBLE_DEVICE enviroment variable
If you can share the above details , then it is easy to handle this requirement.
when i issue nvidia-smi then the Titan is 0 and V100 is 1.
As suggested i have run my job using qsub -l select=1:ncpus=1:ngpus=1 but still it is taking both GPUs as below:-
On host c08 2 GPUs selected for this run.
Mapping of GPU IDs to the 6 GPU tasks in the 6 ranks on this node:
PP:0,PP:0,PP:0,PP:1,PP:1,PP:1
but when the CUDA_VISIBLE_DEVICES is used then the job is using only GPU
On host c08 1 GPU selected for this run.
Mapping of GPU IDs to the 6 GPU tasks in the 6 ranks on this node:
PP:0,PP:0,PP:0,PP:0,PP:0,PP:0
Can you please share any of the sample cgroup hook which is working fine.