Hello community!
I have a problem when running my jobs in array since some subjobs submitted as single jobs are completed in a shorter time. For instance, there is one job, which submitted as a single job takes 43 seconds:
It looks like your program is using more than one CPU (multi-threaded). In your first example, you used 00:11:04 of CPU time in 00:00:43 walltime, which comes out to about 16 CPUs worth.
However, you requested only 1 ncpus. The result is that PBS overcommitted the CPUs when starting multiple array subjobs at once. See if you get the expected results by setting ncpus=16 on your qsub request.
Hi Adarsh, dtalcott,
I made tests without using PBS.
Single run of the job in question took 34 sec.
When running 9 other subjobs in parallel, it takes 2 min 42 sec.
When running 5 other subjobs in parallel, it takes 2 min 33 sec.
dtalcott, yes, you must be right. I did not mention, that I made one 10-subjob array test requesting 20cpus and the job in question took about one 1 min.
Using 16cpus:
The program I run is a Matlab application running with RTE in singularity container.
So, the call is:
$ singularity run container params ...
But still, I don’t get this overcommiting, as you said.
The result is that PBS overcommitted the CPUs when starting multiple array subjobs at once.
Could this be, that PBS took as many cpus as needed when running single job (ignoring my 1 ncpus request) and in result, in 43 seconds walltime the CPU time was 11min 04 sec?
But, when running array job, PBS by itself did not do this allocation?
Not exactly. Unless you are running on a cpuset host or using the cgroups hook, PBS does not restrict which CPUs your job can use. It assumes you are telling the truth with ncpus. So, when your job says it needs just one CPU, PBS figures it can run as many array subjobs at once as you have CPUs. The jobs actually need more than one CPU, so the operating system has to share the CPUs among the jobs, slowing everything down.
Note: You can have PBS check for excessive CPU use while the job is running with the $enforce cpuaverage MoM configuration option. But, that just kills the job, wasting work done so far.
Also, the timing numbers still don’t come out right. It looks like, even with ncpus=16, your jobs are trying to use more CPUs than requested. Could you try something like the following on your qsub: