In version 19.1.1, it appears that array jobs no longer enter an execution queue if the execution queue is not large enough to hold all the subjobs in the array from the outset. This was definitely not the case in version 18.1.3. Has something changed in how array jobs are now handled in version 19? I’ve looked through the manuals but can’t find any clues there. Perhaps there is a new queue configuration setting that I need to set for array jobs?
For example:
An array job consisting of 100 subjobs, each requesting 2 cpu and 1 gb mem, enters the routing queue “workq”. The “workq” queue is configured to route all jobs requesting 2 cpu and 1 gb mem to an exec queue called “solo”. Limits on the “solo” queue allow each user to run no more than 50 jobs at any given time in this queue. In version 18.1.3, 50 subjobs from the array job would enter the “solo” queue and begin running. As they completed, additional subjobs from the array job would trickle in from the routing queue until all subjobs were complete. At no time would the user have more than 50 jobs running in the “solo” queue. In version 19.1.1, the entire array job remains queued in the routing queue and never makes it to the execution queue. I can force the array job to start by manually issuing a move command, eg: “qmove jobid[] solo”, at which point 50 subjobs will start without issue, and the others trickle in as expected. Or, I must increase the limits on the “solo” queue to allow for at least 100 jobs per user, and then the array job starts immediately without the need for a “qmove”.
This might be helpful : https://github.com/altairengineering/pbspro-private/pull/289
Initially, Array jobs would be considered as 1 job (upon initiating array job, one would see many subjobs kind of bypassing the limits) , now each of the subjobs of the jobs are considered as standard jobs hence you are seeing this behaviour.
Thanks for your quick reply. If I click on that link I get a 404. However, I’m not understanding what you are writing. It seems now, under version 19, that each of the subjobs are no longer considered as standard jobs, which is why the subjobs will not enter the queue if their combined total is greater than the max run job limits of the queue. Now, in version 19, it seems that all subjobs are combined together and considered as one big job (from the queueing point of view). Is there a way to revert back to having each subjob be considered on its own, and thus users can submit an array of 1000 subjobs, yet have only 50 subjobs running at one time (or whatever the max running queue limits of the day are)?
Thank you. Your understanding as you have described above is correct.
I am not sure whether if we can revert back by some work around.
You can think of applying other limits max_run_res.ncpus to limit the running jobs, instead of using max_run
I’m using both. Are you suggesting I should remove max_run and max_queued? Below are the relevant settings for the queue. I keep “queued” and “run” at the same levels to keep jobs queued up in the routing queue. I am allowing 500 jobs to run in this queue at the same time per user.
set queue c1_solo max_queued = [u:PBS_GENERIC=500]
set queue c1_solo max_queued_res.mem = [u:PBS_GENERIC=4000gb]
set queue c1_solo max_queued_res.ncpus = [u:PBS_GENERIC=500]
set queue c1_solo max_run = [u:PBS_GENERIC=500]
set queue c1_solo max_run_res.mem = [u:PBS_GENERIC=4000gb]
set queue c1_solo max_run_res.ncpus = [u:PBS_GENERIC=500]