Low GPU utilization with PBS job

caesar84 · January 11, 2018, 9:54pm

I am running Matlab script for deep learning (convolutional neural network) on the Uni HPC (PBS job) using the standalone application. currently, I am facing a real problem that I don’t know what is the reason(s) behind it:
I ran the same script (on the same dataset) a couple of times before using the GPU and it was pretty fast but now it seems like its running using CPU because it’s so slow and takes lots of time. for instance: one run completed 90 Epochs in 40 hours, however now it takes 48 hours to finish only 2 Epochs. The IT guy told me that my job is running on GPU node and the utilization is so low, most of the time is zero. one extra thing, I haven’t changed anything in the wrapper shell text as well as any choice of (CPU over GPU) in the preferred running environment from the original Matlab script. I am quite new to the PBS and have very limited access to the cluster (User with sufficient privileges to run my scripts only). the cluster is running Centos 6.
here is a sample of my wrapper text:

#!/bin/bash
#PBS -l nodes=1:ppn=1,walltime=190:00:00
#PBS -q gpu
#PBS -m abe

export CUDA_VISIBLE_DEVICES=cat /tmp/$PBS_JOBID/gpu

/export/home/2164104a/run_withflip2.sh /export/home/2164104a/MATLAB2017b_Compiler_Runtime/v93/

I would be so grateful if someone has come across such a problem before or can give me any advice.

cheers

jon · January 12, 2018, 7:29pm

Welcome to the community. As for your question, can you provide more information on the jobs that ran fast? Was it the exact same script? Did the job run on fast on a cpu node and now it is running slow on a GPU node?

caesar84 · January 12, 2018, 7:42pm

Thank you a lot for your response. the same script that has been running quite fast, now is running too slow ( as if it’s running on CPU, not a GPU). so in summery, its the exact same script and exact same node (GPU node).
please let me know if you need any further information

scc · January 12, 2018, 9:06pm

Here is what I would check in the rough order of what I think is most to least likely:

PBS always sets OMP_NUM_THREADS, some applications behave differently, possibly slower, when this is set. Try unsetting it from the job script before the application is launched to see if this makes a difference.
Check the kernel limits on the processes (STACK, MEMLOCK, etc.) both inside the job environment and in the environment where it performs better.
“Other” environment variables can have an effect as well, but beyond OMP_NUM_THREADS and TMPDIR it is really application specific. Usually in that case something is missing from the job environment that the application depends on. You can try submitting with qsub -V from the environment where performance is good as a quick test to see if this matters and narrow it down from there if it does.
I have seen application behavior problems because of what $TMPDIR is set to inside of the job. Again, compare $TMPDIR in both environments and see if it matters.
I have never actually seen the mom polling tank an applications performance, but one could adjust the mom’s $min_check_poll and $max_check_poll so that polling is less frequent.

Disclaimer: I don’t remember specifically ever looking into anything like this when a GPU was involved.

HTH.

caesar84 · January 17, 2018, 11:36pm

thanks a lot for your reply @scc

Topic		Replies	Views
Job performance is lower when scheduled through pbs Users/Site Administrators	19	1881	March 18, 2022
PBS Single exection host run job using cpu include gpu Users/Site Administrators	5	1794	May 8, 2021
Job cannot reach 100%cpu when submitting through openPBS Users/Site Administrators	7	1088	March 26, 2021
CPU + GPU jobs on nodes Users/Site Administrators	1	55	June 4, 2025
GPU queue not runing jobs Users/Site Administrators	19	1723	April 5, 2023

Low GPU utilization with PBS job

/export/home/2164104a/run_withflip2.sh /export/home/2164104a/MATLAB2017b_Compiler_Runtime/v93/

Related topics