Qmgr -c 'set server est_start_time_freq ...' fails

buchmann · August 15, 2016, 9:25am

Hi,

According to the Admin Guide (v13), the server should be able to compute “estimated start time” for jobs. However, for my installation of v14.1 qmgr complains when I try to set (or even unset) this feature:

This command is explicitly snipped from the docs/guide, but it still fails:

[root]# qmgr -c ‘set server est_start_time_freq = 3:00:00’
qmgr: Syntax error
set server est_start_time_freq = 3:00:00

Even unset’ting it (which would be the default) fails

[root@bifrost1 ~]# qmgr -c ‘unset server est_start_time_freq’
qmgr: Syntax error
unset server est_start_time_freq

so I wonder if this config parameter is available in 14.1?

Have anybody tried to actually use this? Am I doing something obviously wrong here?

Thanks,

/Bjarne

PS: Presently, I want to turn on this feature to check if a job has any chance to run (within finite time) - such that I can delete it otherwise, see other thread Scheduler and jobs that Can Never Run

billnitzberg · August 15, 2016, 7:07pm

Hi Bjarne,

Alas, this is one of the very few capabilities that are not part of PBS Pro 14.1, and for which we have no plans to release as open source. The team had to do a lot of refactoring/restructuring of the code (transforming from commercial 13.1 to open source 14.1), and this capability didn’t make the cut.

The 14.1 version will still calculate and report estimated start times (and locations) for top jobs, but only for top jobs. What’s not in 14.1 is the separate pbs_est daemon that calculated estimated start times for non-top jobs. Depending on what you are trying to use this information for, perhaps this is good enough.

(FYI, there is a list of capabilities not in 14.1 that are documented in 13.1 on the contributors portal at https://pbspro.atlassian.net/wiki/display/PBSPro/User+Documentation).

Best regards,

bill

buchmann · August 16, 2016, 5:44am

Can you elaborate if this is because the feature will be removed completely from PBSpro - or if it will be available in some “less free” version also in the future? Ie is it a restricted or simply discontinued feature?

With the present product, it seems that I have no way to determine if a job will run “in finite time”. Presently, I cannot (easily) differentiate between a job queued because other jobs presently utilize resources - or queued because the requested resources are not available on the cluster at all (say, a job has requested a node with more cores than available on any of the nodes). In both cases, the job comment is set to

comment = Not Running: Insufficient amount of resource: <RESOURCE>

and I have found no other reliable way to differentiate those jobs. See also thread Scheduler and jobs that Can Never Run

How can I actually get readings for these? I would have expected to find them eg in

qstat -f <JOBID>

or in the scheduler log.

Can I tweak how many “top jobs” estimated start times will be computed for? If yes - then how? The admin guide (AG-133) mention use of backfill_depth, but the ref guide (RG-336) denotes backfill_depth as obsolete.
(I assume that the times and locations will update on every scheduler iteration).

An estimation of start time is not crucial for us at this time, but it is important for us to know that a job will eventually run. Does PbsPro have another way to single out jobs, which cannot run given the presently configured resources? (Excluding nodes explicitly set OFFLINE - as that is necessary in order to let jobs default to a particular node group - see

If I cannot get a handle on the start time, I would like to get an answer to the following question (for each job in the queue): If there were no other jobs in the queue, would the present job then be able to run?

Best,

/Bjarne

agrawalravi90 · August 16, 2016, 6:20pm

Hi Bjarne,

"Presently, I cannot (easily) differentiate between a job queued because other jobs presently utilize resources - or queued because the requested resources are not available on the cluster at all (say, a job has requested a node with more cores than available on any of the nodes). In both cases, the job comment is set to

comment = Not Running: Insufficient amount of resource:
and I have found no other reliable way to differentiate those jobs"

Not sure if this an easy way but one reliable way to know why a job didn’t run is by looking at the scheduler logs. There will be a log message saying “Can Never Run” for jobs which can’t run because the cluster, even if it were not running any jobs at all, doesn’t have enough resources to run a job.

bhroam · August 16, 2016, 7:36pm

Hey Bjarne,
The scheduler does its best to determine when a job can never run. It currently works on the complex as a whole and not on a node by node basis. If a job requests more resources than the complex has, the scheduler will figure out the job can never run. While it is possible to determine if a job requested more resources than is available on any node, it would also be slow. It would require looping over all of the nodes. We have customers who have well over ten thousand nodes. Slowing down the normal operation of the scheduler to check for something that happens once in a while is not something we chose to implement.

As for backfill_depth, I am unaware of it being obsolete. I will check with our docs team about it. It is the method of increasing the number of top jobs. Please note that you will slow the scheduler down the larger number of top jobs you have.

Bhroam

agurban · August 16, 2016, 9:57pm

Hi Bjarne,

The Ref Guide should have said that the backfill scheduler configuration option is deprecated, not the backfill_depth server attribute. Thank you for finding that. I have filed a doc bug for that.

Regards,
-Anne (PBS docs)

buchmann · August 17, 2016, 8:36am

Hi Ravi,

I have tried to look at the scheduler log (with log_filter=0, so everything is logged), and as far as I can see, this is not a reliable way to do it. Unfortunately.

I have two placement sets (node groups) with 2+4 nodes - and not allowing to span. If I explicitly (-l nodes=<N>) ask for more nodes than what is available in a the largest set, then the sched log reads:

<TIME>:0400;pbs_sched;Job;<JOBID>;Placement set nodetype=io is too small: Not enough free nodes available
<TIME>;0400;pbs_sched;Job;<JOBID>;Placement set nodetype=compute is too small: Not enough free nodes available
<TIME>;0040;pbs_sched;Job;<JOBID>;Can't fit in the largest placement set, and can't span placement sets
<TIME>;0040;pbs_sched;Job;<JOBID>;Job will never run with the resources currently configured in the complex

The job gets the comment

comment = Can Never Run: can't fit in the largest placement set, and can't span psets

However, if I ask for more nodes than what is totally configured, then “can never run” does not appear in log or job comment
Sched log:

<TIME>;0080;pbs_sched;Job;<JOBID>;Considering job to run
<TIME>;0040;pbs_sched;Job;<JOBID>;Not enough free nodes available

Job comment:

comment = Not Running: Not enough free nodes available

If I use -l select=<N>:.. then everything gets more complicated, but I have never met a can never run in this case. It is always some kind of Insufficient amount of resource. Even if there is only the one job in the queue.

/Bjarne

buchmann · August 17, 2016, 8:38am

Hi Anne,
You are welcome. Good to see that things propagate back. Thank you for the note.

EDIT:
In the admin guide (v13, §4.8.3.4, page AG-134), it is noted that:

However, it seems that neither the server nor the scheduler accepts the backfill as a parameter.

[root@bifrost1 ~]# qmgr -c 'set server backfill = True'
qmgr: Syntax error
set server backfill = True
                    ^
[root ~]# qmgr -c 'set sched backfill = True'
qmgr: Syntax error
set sched backfill = True
                   ^

Is there a pointer to how I make sure that backfill is running? Presently, I use

set server backfill_depth = 10

but maybe more needs to be added?

/Bjarne

buchmann · August 17, 2016, 8:51am

Hi Bhroam,

Firstly, thanks for answering me once again.

That is understood, and it is why I am presently hoping to get access to the estimated start time of the “top jobs”.

Sure, that is OK. We are at a cluster with a reasonably low number of both nodes and jobs, so I do not expect problems. If things get hairy, then I will limit the total number of jobs in the execution queue, and add a routing queue to hold jobs, which will run later. But typically, we expect only a few handfuls of jobs on the system at any one time.

Can anybody tell me how/if I can read the estimated start time (and estimated vnodes) for “top jobs” in the queue. And possibly if “my” job is a “top job”? (Alternatively a list of the “top jobs”).

Thanks,

/Bjarne

bhroam · August 17, 2016, 8:49pm

Hey Bjarne,
First off:

‘backfill’ is a sched_config option. It’s not set via qmgr. You need to edit the sched_config file and then HUP the scheduler.

Before I answer your question, I’d like to point out a danger in deleting jobs like you want to do. Sometimes the scheduler can determine that the jobs can never run due to transient details. One that pops to mind is nodes being down. If enough nodes are down, the scheduler can fail to calculate where to run the job. If you go about deleting these jobs, it is possible that you could drain your entire system of jobs. That being said, it is an unlikely situation. Once a job gets an estimate, it will keep that estimate until it changes. If the scheduler fails to calculate the job’s estimate, it will retail the old one.

The are a couple of ways to read the estimated start time. First, qstat -T will order the jobs by the estimated start time (and print it in a short form). Second, you can do a qstat -f and look for the estimated.start_time attribute. You can also look for estimated.exec_vnode for where the job will run at that time.

There is a log message that is printed of a job is a top job. It says “Job is a top job and will run at …” Other than that you need to look for the existence of a start time estimate.

buchmann · August 18, 2016, 5:47am

Thank you for the continued support. Still problems in this end, though.

I have tried to add the following to sched_config:

backfill:   true

and reloaded config (pkill -HUP pbs_sched). The log then stated:

...;pbs_sched;Fil;sched_config;Obsolete config name backfill, instead use server's backfill_depth=0

I assume that this means “no cigar”.

EDIT:
I note that the default sched_config (“factory file”) does not have a section describing a backfill parameter. As backfill seems like a very important part of the system, I would have expected a section about it.
/EDIT

In our case that happens to be quite OK. We have no real-life lusers, but only operational jobs, that must run. If enough nodes are down the jobs should be deleted. (The underlying scripts running the jobs will detect this and mail the operator.) Worst case scenario for us actually happens to be jobs stuck in queue unnoticed.

This would be perfect for me, as I gather all the info from qstat -f <JOBID> automatically anyway. As I presently do not see any jobs with estimated.start_time I still assume that it is not computed - probably because backfilling is obviously not turned on(?)

Presently, I use set server backfill_depth = 10, but I don’t know what else to do to enable it.

EDIT2:
Actually, the present configuration seems to perform some backfilling. Testing shows that if I submit a job requiring all jobs except one and subsquently a second job requiring all nodes. Then obviously, the first job runs, while the second is queued. A third - much shorter - job requiring just the single node left by the first job can now be allowed to run - bypassing the second job. Thus, it appears that backfilling is working. But I do not see any mentioning of backfill in scheduler or server logs.
Nor do I see any estimated things in qstat -f.
/EDIT2

/Bjarne

arungrover · August 18, 2016, 11:04am

Well setting backfill_depth to a positive non-zero value should turn backfilling on.

In the test scenario that you quoted do you have strict_ordering enabled? unless you have that enabled, your second job will not become a “top job” and backfilling will not happen. Also, setting strict_ordering is one way of making a job top job, there can be other ways of doing the same thing.
The reason that you see your third job running is because when scheduler figured the second job could not run it moved to the next job which it wants to run.

To test backfilling I’d suggest you make the walltime of your third job same or greater than that of the first job. If backfilling happens then your third job will not run because it’s node solution will clash with the estimated node solution of your second job (Top job). But, if you modify the walltime of your third job (filler job) in such a way that it end before the first job is estimated to end then you will see this filler job getting backfilled and it will start running.

I hope I’ve not managed to confuse you

buchmann · August 18, 2016, 11:47am

Not in the least. Hopefully, the following blob does not throw you off either.

Actually not. strict_ordering is set (in sched_config) to
strict_ordering: false ALL
I was not aware that this was necessary. So thank you very much - this is a big step for me.

Very good point.
In the test scenario (with strict_ordering: false ALL), the third point is allowed to run even with a specified wall time much larger than the first job. Thus, it is allowed to run not because of backfilling, but just because the scheduler deems it feasible.

I have tried to set strict_ordering: true ALL, and that has the effect that the second job gets the estimated.start_time and estimated.exec_vnode attributes. So that is good.

Unfortunately, the scheduler seems to make some kind of hash of it, such that the third job runs anyway. The problem (presently) is that all jobs explicitly asks for one “infiniband” per unit, and there is defined exactly one of these per node (infiniband is an integer resource). However, the scheduler apparently thinks that in the future (at expected.start_time) it may use two of these from each node - thus allocating a four-node job (-lselect=4:infiniband=1) to two nodes. Thus, the scheduler believes that there is plenty of room left to run the third job (as the second job does not need all the nodes, but just half of them).

Even if I change the select statement to the old-school “-l nodes=N”, the scheduler lets the (now long) third job run, as it thinks it can get away with just allocating one cpu per node. This will eventually fail, as the nodes are exclusive to each job. So, if a job gets the node, then no other job can use it until the first job is done. In the present case, the third job then locks the node, but the scheduler does not realize this until later when the first job finishes, but the third still cannot run. At this point, the estimated.start_time is not updated - so it is in the past. (edit: deleted sentence)

With the enabling of strict_ordering the existence of “top jobs” show up in the sched log for the first time for me, so for sure now backfill is enabled. But it still does not do a very good job. I would deem this a possible bug in the scheduler, but I would very much like some input from one of you before I start a separate thread to report it.

If I help the scheduler a little bit more, then I can make backfilling work. But the fix depends on our specific cluster layout: We have two flavours of nodes: compute with ncpus=20 and io with ncpus=12. Jobs default to the io nodes, but can explicitly ask for comnpute. If I tag a ncpus=12 on my select statement, then the scheduler realizes that only a single task can go on each node, and then backfilling works. At that point, the infiniband=1 specification becomes redundant, but if I install nodes with, say, 24 cores, then this hack/fix will no longer work.

Thanks,

/Bjarne

arungrover · August 18, 2016, 12:24pm

With this select specification can you try using -lplace=scatter or vscatter depending upon whether you are running with vnodes. This will make the first job scatter across nodes/vnodes and will not try to pack chunks onto one node. This will probably make your third job to not run.
Few more questions about Infiniband:
Is Infiniband a node level consumable resource? Is this resource also added to “resources” option in scheduler config? What is the value of infiniband on each node?

buchmann · August 18, 2016, 12:55pm

I’ll try to do that and report back ASAP. Thanks. (We use just “natural vnodes”, ie. one vnode per physical node).

Is Infiniband a node level consumable resource?
I believe the answer is yes. It is defined as long (int).
Is this resource also added to “resources” option in scheduler config?
Yes:

[root@server]# grep ^res /var/spool/pbs/sched_priv/sched_config
resources: “ncpus, mem, arch, host, vnode, netwins, aoe, nodetype, compute, infiniband”
What is the value of infiniband on each node?
1 (one).

If I ask for this: “-lselect=1:infiniband=2” then the job is queued but never runs.

/Bjarne

buchmann · August 18, 2016, 2:27pm

Yes, this did the trick. With this option, the scheduler realizes that it will need N nodes, and makes the right decision on the start time - holding back the third job (unless the third job is short enough to run along-side the first job - in which case the third job is backfilled).

Kudos all around.

Thanks,

/Bjarne

Topic		Replies	Views
PBSPro 18.1.2 default scheduler Users/Site Administrators	12	3428	November 14, 2018
PP-482: Non-destructive walltime Developers	39	3697	October 20, 2017
Job not getting distributed among nodes Users/Site Administrators	41	3101	June 19, 2022
PP-928: Reliable Job Startup Developers	44	3994	September 20, 2018
How to scatter jobs over vnodes? Users/Site Administrators	30	8133	May 19, 2020

Qmgr -c 'set server est_start_time_freq ...' fails

Related topics