Not in the least. Hopefully, the following blob does not throw you off either.
Actually not. strict_ordering
is set (in sched_config
) to
strict_ordering: false ALL
I was not aware that this was necessary. So thank you very much - this is a big step for me.
Very good point.
In the test scenario (with strict_ordering: false ALL
), the third point is allowed to run even with a specified wall time much larger than the first job. Thus, it is allowed to run not because of backfilling, but just because the scheduler deems it feasible.
I have tried to set strict_ordering: true ALL
, and that has the effect that the second job gets the estimated.start_time
and estimated.exec_vnode
attributes. So that is good.
Unfortunately, the scheduler seems to make some kind of hash of it, such that the third job runs anyway. The problem (presently) is that all jobs explicitly asks for one “infiniband” per unit, and there is defined exactly one of these per node (infiniband is an integer resource). However, the scheduler apparently thinks that in the future (at expected.start_time) it may use two of these from each node - thus allocating a four-node job (-lselect=4:infiniband=1) to two nodes. Thus, the scheduler believes that there is plenty of room left to run the third job (as the second job does not need all the nodes, but just half of them).
Even if I change the select statement to the old-school “-l nodes=N
”, the scheduler lets the (now long) third job run, as it thinks it can get away with just allocating one cpu per node. This will eventually fail, as the nodes are exclusive to each job. So, if a job gets the node, then no other job can use it until the first job is done. In the present case, the third job then locks the node, but the scheduler does not realize this until later when the first job finishes, but the third still cannot run. At this point, the estimated.start_time
is not updated - so it is in the past. (edit: deleted sentence)
With the enabling of strict_ordering
the existence of “top jobs” show up in the sched log for the first time for me, so for sure now backfill is enabled. But it still does not do a very good job. I would deem this a possible bug in the scheduler, but I would very much like some input from one of you before I start a separate thread to report it.
If I help the scheduler a little bit more, then I can make backfilling work. But the fix depends on our specific cluster layout: We have two flavours of nodes: compute
with ncpus=20
and io
with ncpus=12
. Jobs default to the io
nodes, but can explicitly ask for comnpute
. If I tag a ncpus=12
on my select statement, then the scheduler realizes that only a single task can go on each node, and then backfilling works. At that point, the infiniband=1
specification becomes redundant, but if I install nodes with, say, 24 cores, then this hack/fix will no longer work.
Thanks,
/Bjarne