Nodes with suspended job seen as free by PBS


I am having difficulties with job suspension behavior.

The express queue jobs are suspending the regular jobs as intended. However, the problem is when a large multiple node regular job is suspended by a small express one: the nodes where the regular job is now suspended are seen as free by PBS (except where the express job is running) and hence regular jobs have no problems occupying them. It can even happen that the initial large job never gets enough resources back to resume.

Perhaps an alternative symptom of this problem:
pbsnodes -vSja
returns 0 for number of suspended jobs on each node even when qstat -a shows suspended jobs. This fact might be the reason why PBS is putting jobs on those nodes.

I have been extensively searching through the administrator’s guide, but found nothing helpful so far.

Our PBS version is 19.1.3

Any help would be very appreciated.


What you need to do is to make sure your low priority job is restarted when the high priority job is over. To do this, do qmgr -c ‘s sched sched_preempt_enforce_resumption=true’
This will add all preempted jobs to the calendar, so they will resume when the high priority job finished. The reason this is not true by default is because it will negatively affect the schedulers performance. Since every preempted job is added to the calendar, and it can be slow.

It won’t be bad for a handful of jobs, but if you preempt hundreds or thousands of jobs, the scheduler will slow down to a snails pace.

Thank you!
This helped a lot and it got me to the desired behavior.

It still puzzles me though, that the suspended job count on nodes with suspended jobs is zero (pbsnodes -vSja). But I can live with that.