I am having difficulties with job suspension behavior.
The express queue jobs are suspending the regular jobs as intended. However, the problem is when a large multiple node regular job is suspended by a small express one: the nodes where the regular job is now suspended are seen as free by PBS (except where the express job is running) and hence regular jobs have no problems occupying them. It can even happen that the initial large job never gets enough resources back to resume.
Perhaps an alternative symptom of this problem:
returns 0 for number of suspended jobs on each node even when qstat -a shows suspended jobs. This fact might be the reason why PBS is putting jobs on those nodes.
I have been extensively searching through the administrator’s guide, but found nothing helpful so far.
Our PBS version is 19.1.3
Any help would be very appreciated.