Primetime/holidays questions

Dear Wizards,

I have a few questions regarding how to use (non-)primetime settings and the holidays file.

As far as I can read, the config file PBS_HOME/sched_priv/holidays must contain the present year info in a line like

YEAR 2016

Apparently, there can only be one YEAR per file (and thus per pbs_sched config. So how do site admins in practice deal with year roll-over? Sit tight around midnight on Jan 1st and reconfigure - or copy/move/symlink files with the at command (or cron)? All these options seem unattractive for operational/stability reasons.

Also, the actual format of the holidays seems somewhat lacking (this would be a devel issue), as both the day-of-year and the calendar date must be specified for each day. One of them aught to be enough.

From an operational perspective, the file should be in a format, which would allow it to always contain information for at least some days (or a month) into the future. This could be done by either allowing several YEAR blocks, or specifying the holidays as e.g.

YYYY-MM-DD Comment

such that multi-year info can be compiled into a single file. But presumably, some experienced site admins already thought of some solution, which do not require recoding of the scheduler.

Actually, we do not expect to use primetime/non-primetime at our clusters. Thus, as the admin guide (v13.1 §4.8.35.2 page AG-260):

So, I assumed that if I removed the holidays file, then all was well, and I would not use primetime/non-primetime. That seems to be the case, but the scheduler complains (logs) about it:

...;0001;pbs_sched;Svr;pbs_sched;No such file or directory (2) in parse_holidays, Error opening file holidays
...;0040;pbs_sched;Fil;holidays;Warning: cannot open holidays file; assuming 24hr primetime
...;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.

If I add an empty file, then the scheduler complains (presumably because YEAR is missing) that

...;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.

So, how do I turn off primetime/non-primetime without the scheduler complaining.
I can create an “almost empty” holidays file with just the content YEAR 2016, but then the scheduler will likely complain once we get a few seconds into 2017. (Plus I need to update a redundant file).
Edit: Actually, this last option seems to make every day primetime, but “ending at midnight”. So it is likely a bad idea. Presently, I will remove the holidays file altogether (or alternatively leave it empty), and live with the scheduler complaining.

Note that this is likely only a logging/warning issue - I expect the the scheduler actually behaves like expected.

Best,

/Bjarne

Hey Bjarne,

Thanks a lot for the research and all the good suggestions!

There actually has been an RFE (internally) to fix the holidays file for some time now, I personally worked a bit on it and did try to get rid of all the pain points that you mentioned. With the transition to Open Source, the RFE sort of got left behind and needs to be re-prioritized. I think it is now targeted for the 15.0 release. As soon as a dev picks it up again, you should see some activity on the forums about it. I’ll try to convey your observations to the team when this starts getting worked on again.

Thanks again!
Ravi

It will still work with the existing content though (which for real holidays will unfortunately be incorrect).

There is actually a way to specify year 0, which means “this file is valid for all years”. Sadly, an error message is still spat out in the scheduler log even though it all works.

Of course real holidays do change from year to year (which basically means that for most holidays you have less than a year to update the corresponding holiday after it has passed in the current year).

You can edit the file and send the scheduler a HUP when you update the file, or even kill and restart the scheduler (since it is stateless. By the way, you can force the servder to give a new scheduler an immediate kick using qmgr -c “s s scheduling=true”).

Yes, it’s not perfect if you would like accurate simulations for jobs longer than a year (and if you have jobs that are a couple of months you’d be forced to update the holidays sooner than you might otherwise) but that’s not been a very common use case people have asked me about.

Hi Alexis and Ravi,

OK, good to know.

Yes, I am aware of that, but it has to be done right after midnight on Jan 1st. On our site, nobody will actually be available to do that, so we would have to set it up with cron or at - and we cannot test it in advance.

It is actually much worse than that. If both Dec 31st and Jan 1st happen to be holidays (they are in several countries), then there is a problem even for very short jobs. If I use a 2016 file to specify Dec 31st, then the scheduler will be unaware that the following day (Jan 1st 2017) is a holiday until I “rotate” the holidays config file. But that cannot happen until Jan 1st - when the holiday should already be in effect.

I see the present config method as broken in (at least) three ways:

  1. The scheduler cannot “see past” New Year, and as we get closer to that scheduling will be based on assumptions, which may be wrong.
  • System administration has to be executed at a particular time (Jan 1st just after midnight), which is really not a “nice feature” IMHO. It is prone to errors.

  • Dates have to be specified twice on the same line - both in terms of day-of-year and with a “named date”. One of them should suffice. Even the line for a fixed holiday like Christmas day will vary between leap years and non-leap years. This also is error prone.

I would suggest to restructure the way holidays are configured to something as simple as:

That way, it is possible to specify the dates as far ahead as the admin wants. It should be possible for the scheduler to spit out warnings if, say, present time gets past the latest configured holiday.

I am sure quite a few sys-admins would appreciate that.

Thank you both for the comments!

Best
/Bjarne

“The scheduler cannot “see past” New Year,”

Actually if I remember correctly the scheduler assumes the next year is like the previous one, with the “year” just there to print a reminder in the logs to update the file if it thinks it’s still “last year’s file”.

Even if that’s not the case, that’s definitely the behaviour if you set YEAR to 0.

Which means that you can update holidays that have just passed this year and already fill in next year’s holiday, which means you don’t have to rotate things on Jan, 1 and will only have trouble with jobs that are very far into the future (i.e. almost a year).

Hi Alexis,

Thank you for the notes.

Quite so.
If what you state is correct (you do not seem 100% certain at this point), then it does solve the most pressing issues (#1+#2). However, it will be really hard to test, it does raise several counter-intuitive config options, and it is all undocumented, right? Plus, we know that the scheduler will spit out various warnings related to the YEAR - even if we follow these guidelines.

If possible, I would still support Ravi’s idea of getting it right sometime not too far in the future.

However, your post does make it possible to “survive the day” until then.

Many thanks!

/Bjarne

No argument from me – which is why they’re been an RFE to refactor that code for a very long time. It’s just that given some clever plumbing manages to make you survive, that RFE always ended up at too low a priority compared to other work, I think. But you’re not the only one who has pressed for this, so it is currently scheduled for 15.0.

I would have to add, though, that many sites are electing to turn into “7x24” shops that essentially have no holidays and only one kind of time (with an empty holiday file and “none all” for all days.)

Because supporting different scheduler policies in prime and non prime and having lots of transitions has a cost when you want to increase the backfill depth, especially if you have jobs with an estimated start time into the far future: in simulation every prime to non-prime transition or vice-versa is an event that needs to be simulated to see if it has an impact on whether a job can be scheduled or not.

One site managed to insert a 5000 year walltime job into the system and managed to stop scheduling since the top job needed its resources and was going through 5000 years of prime-non-prime-prime-… transitions. The scheduler never got through those 5000 years before the scheduling cycle length alarm was triggered…

It’s a very good feature for some kinds of workloads (with people submitting short jobs in prime time and waiting for the results), but it can be a nuisance if you need strict ordering and some jobs can be delayed for quite some time…

I can easily see how that is the case.

Many thanks for sharing the tricks.

/Bjarne