Not in the least. Hopefully, the following blob does not throw you off either.
strict_ordering is set (in
strict_ordering: false ALL
I was not aware that this was necessary. So thank you very much - this is a big step for me.
Very good point.
In the test scenario (with
strict_ordering: false ALL), the third point is allowed to run even with a specified wall time much larger than the first job. Thus, it is allowed to run not because of backfilling, but just because the scheduler deems it feasible.
I have tried to set
strict_ordering: true ALL, and that has the effect that the second job gets the
estimated.exec_vnode attributes. So that is good.
Unfortunately, the scheduler seems to make some kind of hash of it, such that the third job runs anyway. The problem (presently) is that all jobs explicitly asks for one “infiniband” per unit, and there is defined exactly one of these per node (infiniband is an integer resource). However, the scheduler apparently thinks that in the future (at expected.start_time) it may use two of these from each node - thus allocating a four-node job (-lselect=4:infiniband=1) to two nodes. Thus, the scheduler believes that there is plenty of room left to run the third job (as the second job does not need all the nodes, but just half of them).
Even if I change the select statement to the old-school “
-l nodes=N”, the scheduler lets the (now long) third job run, as it thinks it can get away with just allocating one cpu per node. This will eventually fail, as the nodes are exclusive to each job. So, if a job gets the node, then no other job can use it until the first job is done. In the present case, the third job then locks the node, but the scheduler does not realize this until later when the first job finishes, but the third still cannot run. At this point, the
estimated.start_time is not updated - so it is in the past. (edit: deleted sentence)
With the enabling of
strict_ordering the existence of “top jobs” show up in the sched log for the first time for me, so for sure now backfill is enabled. But it still does not do a very good job. I would deem this a possible bug in the scheduler, but I would very much like some input from one of you before I start a separate thread to report it.
If I help the scheduler a little bit more, then I can make backfilling work. But the fix depends on our specific cluster layout: We have two flavours of nodes:
ncpus=12. Jobs default to the
io nodes, but can explicitly ask for
comnpute. If I tag a
ncpus=12 on my select statement, then the scheduler realizes that only a single task can go on each node, and then backfilling works. At that point, the
infiniband=1 specification becomes redundant, but if I install nodes with, say, 24 cores, then this hack/fix will no longer work.