Understanding Top Jobs and Calendaring (jobs will run in the past)

Hi,

I am trying to understand how the scheduler uses top jobs and calendering to place jobs.

On Friday (29/11/19) we turned on strict_ordering in the sched_config and increased the backfill_depth (was originally 1).

We started getting messages about top jobs, however, some of these jobs were “calendered” to run in the past. In addition, some newly submitted jobs were also being calendered in the past.

I think what may have happened is that we had some two-week-old jobs in the queue which were not being run. And these may have been a top job sometime in the past.

PBSAdminGuide18.2.pdf contains:

4.9.40.1.ii What Changing Calculation Speed Affects
Changing this attribute takes effect on the next scheduling cycle. If you change this attribute, top jobs are recalculated in the next scheduling cycle.

Once as ASAP reservation is made, it is fixed. If you change opt_backfill_fuzzy later, the reservation start time does not change, even if it becomes degraded. PBS finds new vnodes for degraded reservations, but does not change the start times

In this section, I am unsure if the comment about ASAP reservation applies to Top Jobs.

Here are some log entries (please note the dates):

11/29/2019 11:12:03;0080;pbs_sched;Req;;Starting Scheduling Cycle
11/29/2019 11:12:05;0080;pbs_sched;Job;130654;Job is a top job and will run at Thu Dec  5 08:41:52 2019
11/29/2019 11:12:06;0080;pbs_sched;Job;130798;Job is a top job and will run at Fri Dec 13 09:22:06 2019
11/29/2019 11:12:06;0080;pbs_sched;Job;132415;Job is a top job and will run at Thu Dec 19 08:41:52 2019
11/29/2019 11:12:08;0080;pbs_sched;Job;134593;Job is a top job and will run at Sat Nov 16 20:53:18 2019
11/29/2019 11:12:09;0080;pbs_sched;Job;134594;Job is a top job and will run at Sun Nov 17 04:28:38 2019
11/29/2019 11:12:09;0080;pbs_sched;Job;134595;Job is a top job and will run at Sun Nov 17 06:30:39 2019
11/29/2019 11:12:10;0080;pbs_sched;Job;134859;Job is a top job and will run at Sat Nov 30 13:36:43 2019
11/29/2019 11:12:10;0080;pbs_sched;Job;134860;Job is a top job and will run at Sat Nov 30 16:56:16 2019
11/29/2019 11:12:11;0080;pbs_sched;Job;134861;Job is a top job and will run at Sat Nov 30 18:54:47 2019
11/29/2019 11:12:12;0080;pbs_sched;Job;134862;Job is a top job and will run at Sat Nov 30 20:18:07 2019
11/29/2019 11:12:13;0080;pbs_sched;Job;134863;Job is a top job and will run at Sat Nov 30 21:15:58 2019
11/29/2019 11:12:15;0080;pbs_sched;Job;134864;Job is a top job and will run at Sat Nov 30 23:13:13 2019
11/29/2019 11:12:15;0080;pbs_sched;Job;134865;Job is a top job and will run at Sat Nov 30 23:43:02 2019
11/29/2019 11:12:16;0080;pbs_sched;Job;135968;Job is a top job and will run at Sat Nov 30 23:43:02 2019
11/29/2019 11:12:17;0080;pbs_sched;Job;135969;Job is a top job and will run at Sun Dec  1 00:47:06 2019
11/29/2019 11:12:18;0080;pbs_sched;Job;135980;Job is a top job and will run at Sun Dec  1 01:19:47 2019
11/29/2019 11:12:18;0080;pbs_sched;Job;135982;Job is a top job and will run at Mon Dec  2 14:41:19 2019
11/29/2019 11:12:19;0080;pbs_sched;Job;135986;Job is a top job and will run at Mon Dec  2 15:54:10 2019
11/29/2019 11:12:21;0080;pbs_sched;Job;135987;Job is a top job and will run at Sat Dec  7 13:36:43 2019
11/29/2019 11:12:21;0080;pbs_sched;Job;135988;Job is a top job and will run at Sat Dec  7 18:54:47 2019
11/29/2019 11:12:22;0080;pbs_sched;Job;136069;Job is a top job and will run at Sat Nov 16 11:41:48 2019
11/29/2019 11:12:22;0080;pbs_sched;Job;136070;Job is a top job and will run at Sat Nov 16 16:27:38 2019
11/29/2019 11:12:22;0080;pbs_sched;Job;136071;Job is a top job and will run at Sat Nov 16 19:53:35 2019
11/29/2019 11:12:23;0080;pbs_sched;Job;136072;Job is a top job and will run at Sat Nov 16 21:03:18 2019

My question is the time for a top job recalculated every scheduling cycle, or is it remembered between cycles?

If it is remembered, is it possible to wipe the “estimated.start_time”, so it can be recalculated?

Does the “opt_backfill_fuzzy” have an effect on any of this?

We are currently running 18.1.2

Thanks,
Ashley

Looking into this further today. I suspect we may have hit the issue which is addressed in this patch:

It’s recalculated every scheduling cycle.

opt_backfill_fuzzy can speed up the time that the scheduler takes to calendar a job, it basically hops over a few events between each try of “can this job run now?”. So it can affect when the job is estimated to be run, although it should push it forward in the future and not the past (nothing should cause that of course). The behavior you are seeing was probably the bug that you found the patch for.

Thanks. That is what I thought should be happening, based on past versions of PBS.

I will patch our servers and see if that helps.

Upgrading to 18.1.3 seemed to fix the problem with the dates in the past.
We are getting more consistent behaviour from the scheduler now.

Great to hear that, then it must definitely have been the same bug.