Reservations in PBS pro (vs other schedulers)

Dear Wizards,

I am fishing for hints/advises for pbs-pro configuration for a cluster used for “operational production” (high-prio jobs run a set number of times per day).

Our old cluster runs a different scheduler, and we want to move to PBS pro along a “path of least resistance”. Without bending PBS pro too much - but also without having to rewrite scripts, jobs guides etc more than we have to.

My present query is related to how reservations work in PBS pro - vs how we are “used to” reservations working.
We have an operational user (“prod”), which automatically runs all the high-prio jobs. Other users submit jobs for eg. testing, development and hindcasting.

We enforce a 1.5h maximum wall time on all jobs - and since we have full control of users, no jobs will ask to get the walltime extended during run time. An “operational reservation” can be submitted more than 2 hours before it is used. In principle, this does not actually have to be implemented as a “reservation”. Anything, which will keep other users jobs away from the “reserved” nodes will do. Maybe hooks is the way to go - I don’t know PBS pro well enough to make that decision. Preferably, jobs from user prod should be able to run on any nodes - even spanning to nodes, which are not explicitly part of the “reservation”. It does not really matter how much other jobs/users are delayed in that process.

Preferably, our “job control system”, which actually submits operational jobs to the queue/scheduler, should not need to know anything about reservation queues and similar things.
On the “old system” all jobs are submitted to a single queue, i.e. there are no “reservation queues”.

Our node usage is always job-exclusive: mom are set up with “sharing = force_exclhost”. To help the scheduler, jobs are also submitted with -l place=exclhost (typically even scatter:exclhost).

Thus, basically I am looking for ways to have PBS pro “reserve” some resources/nodes in time - without creating “boundaries” for the jobs in time or space (spanning nodes).

I cannot line up all the jobs prior to the start of the “operational compute cycle”, as some jobs will depend on various external parameters - also stuff, which is unknown at the onset of the compute cycle. Also, it is quite normal, that there may be a few minutes at a time, where the queue is seemingly empty of operational jobs, and it is important that the scheduler does not start low-priority jobs at these times. It is also normal, that some operational jobs may start “ahead of time” - as soon as they are ready to go, rather than waiting for the onset of the reservation.

If I use what seems to be the standard reservation in PBS pro (pbs_rsub), then apparently I must submit (or move) jobs to the reservation queue in order to use the reservation. Also, jobs cannot start before the reservation onset - and continue to run while the reservation becomes active. Although this is a smaller issue for us, jobs also seemingly cannot span reserved and non-reserved hosts.

Are there any commonly used / well-known ways to reserve resources in a “transparent to the jobs” way? A good way to do this would help me a lot in porting to PBS pro.

Please let me know if I ask the wrong questions, or if the questions do not make sense.

Thanks!

Bjarne

Hello Bjarne,

Forgive me if I’m oversimplifying your environment, but it sounds like your main requirement is to have your production jobs run immediately. This may be accomplished by utilizing express queues (see admin guide). I would suggest creating an express queue and applying an ACL that only allows submission from user “prod”. The scheduler will preempt other jobs to get your production work running ASAP. If express queues are not an option, please help us better understand your requirements.

Thanks,

Mike

Hi Mike,

That is indeed the case. However, we also run lower-priority jobs, which may claim the resources for up to, say, 60 or 90 minutes at a time; jobs, which we cannot easily checkpoint, and restart (after preemption) will require quite a bit of coding on our part.

An express queue will help, but if I understand it right, it still will not solve my problem entirely.

Right. Preempting is an option, but I would rather avoid it if possible. There are several reasons for this - both related to the high-prio jobs - and the jobs being preempted:

  • The queue may be empty for short periods quite often, so lower-priority jobs could potentially start and be preempted multiple times, which will require quite a bit of cleaning implementation on our side (this is related to how we actually run the jobs).

  • Some high-prio jobs may “start early” and will not really be high-priority until we get to the “reservation window”. These jobs typically use only a small set of the nodes and run for a short period, but with an express queue they could disrupt a larger low-prio job, which is almost done (thus loosing overall CPU cycles).

I’ll be happy to elaborate if you think it makes a difference.

It might be an idea to use implement an extra queue, but only use it as a real express queue during the “reservation time”. Also, hooks could prohibit jobs from the “main” queue to start if the job resource request “overlaps” with the “reservation” requirement (time & nodes). Would that make any kind of sense?

For sure there may be finer details, which I do not understand yet, so please correct/suggest.

Thanks so much for your help!

2>&Bjarne

Hello Bjarne,

If you would rather not use preemption then perhaps, as you’ve surmised, using a runjob hook could get you want you want. The hook could reject any low priority job(s) that want to run on the set of “reserved” nodes w/ a walltime that would overlap the “reserved” period. Any operational jobs would be allowed to run on those nodes at the reserved time, or earlier, and would also be allowed to use any extra nodes if available. Because of the way PBS schedules jobs only one job per scheduling cycle will request a specific node so you don’t need to worry about job after job trying to run on a reserved node within the same cycle. Obviously you would need to enforce that all jobs be submitted w/ a walltime request. A potential drawback to this approach would be that it could be rather inefficient as each scheduling cycle the scheduler may try to run jobs on any empty reserved nodes only to be rejected. Also, the scheduler may try to run the same low priority job each scheduling cycle on the same nodes as they will remain available until the operational job starts. That job may have to wait quite a while to actually start running.

Sincerely,
Sam Goosen

Hi Sam,

Thank you for the suggestions and recommendations.

That is very much along the lines I am thinking at the moment: I am still aiming to solve the issue with a runjob hook being a major part of the solution.

I now think about defining three queues:

  1. A standard queue for all kinds of low-prio jobs
  • A “production queue” for the high-prio jobs. This queue will always have higher priority - even if no other part of the solution is in effect

  • A “holding queue” (enabled but not started) to hold jobs deferred from the runjob hook. If possible, the runjob hook should be able to move low-prio jobs to this queue, so the scheduler does not have to deal with them in every scheduling cycle.

The runjob hook - as I envision it presently - will both bar the low-prio jobs from start running, and also move the deferred job to the holding-queue, so the scheduler does not have to process all the jobs at each scheduling cycle. Hopefully, this means that high-prio jobs may run quite quickly.

Maybe I should even have a queuejob hook, which may defer jobs (to the holding queue) upon submission, if it is obvious that the job must use reserved resources.

I hope that this overall idea may actually work. There are still a number of smaller things, which I need to resolve before I actually try to code the solution:

  • Figure out if a runjob hook can (is allowed to) move a job to a different queue

  • Make some kind of “fake resource”, which I can reserve/release (pbs_rsub/pbs_rdel) to signal the reservation period to the hooks

  • Make a mechanism to move held jobs back from holding to standard queue when the reservation ends. If the reservation is actively deleted, then I can write a script to do it (no additional hook necessary), but if the reservation ends normally (time runs out), then there aught to be some automated mechanism in place. I do not see a possibility to have a hook running at “reservation end” - symmetric to a resvsub hook(?) Also, I see no periodic hooks to run server-side (not on execution hosts), such as “at every scheduling cycle”, which could check if the reservation is still in place.

Once again, thank you for the ideas!

2>&Bjarne