In certain cases the scheduler may see that a job will never run. If too many resources are requested, like asking for ncpus per node > than what is available on any node in the system, or more nodes than is available etc. The number of nodes in a placement set (pcat, aka node group) may also be the cause of this.
In these cases, the scheduler sets the job comment to include something along the lines of:
comment = Can Never Run: can’t fit in the largest placement set, and can’t span psets
comment = Not Running: Insufficient amount of resource:
but the job itself just hangs around in the queue indefinitely.
On the other hand, if I impose a hard max on, say, the number of nodes (server or queue side) with e.g.
qmgr -c “set server resources_max.nodect = 3”
then qsub will fail (upon job submission) and the job will never enter the queue.
I assume that this is by design. Obviously, a very optimistic user could hope that somebody will install more nodes - or more cores per node etc. But in our case I might as well have the job be deleted (with some comment and/or log message) automagically.
I have not found a setting to do this, i.e. have the scheduler deem that this job will not ever run unless more resources are defined. The AdminGuide only states that " the job stays queued" (AG-124 §4.6.2), but does not say if something could be done about it.
Does anybody know if that is possible (or that it for sure is not?).
If it is not possible, then I’ll find a way around it - to look at the comments of queued jobs and explicitly fail jobs which have such comments. But it might be better, if a server/scheduler setting could fix it.