Hello Everyone,
I encountered in a difficult situation. Hope someone can help me out.
I have 48 GPUs connected to 6 compute node. (each node has 8GPUs). Once I use qsub to submit 48 GPU jobs, any jobs submitted later will be queued. However, when 48 GUP jobs are running, my total CPU usage is only 20%. I don’t want to waste my CPU resources.
Is there someway that I can follow to solve this awkward situation? When GPU jobs are running, I can still qsub CPU jobs at the same time to maximize CPU usage.
my total CPU usage is only 20%. I don’t want to waste my CPU resources.
[A]: Please over subscribe your cpu resources then what is available by using the below command
qmgr -c "set node NODENAME resources_available.ncpus=48"
48 is just arbitrary count, please use a count that will maximise the cpu utilisation.
Have you considered using a queuejob hook to manipulate the job submission?
Test to see if you can submit a job with ncpus=0, i.e., qsub -l select=1:ncpus=0:ngpus=4 jobscript
I am assuming that you are using ngpus for requesting the GPU resource…
If that works, then you could have a queuejob hook parse the select statement and if it detects ngpus, then zero out the ncpus.
OR if you have hybrid jobs (CPU and GPU), then could introduce a custom resource to describe job type (e.g. jtype) that has valid values of cpu, gpu, hybrid. If the jtype is gpu, then zero out the ncpus.