Allowing to schedule node maintenance with a possibility to run new jobs until the maintenance begins

Thank you @vchlum! I believe the EDD should also cover at least the following:

  • Is there a way to delete and query the upcoming maintenance events (add pbs_mdel and pbs_mstat?)? I see that this was discussed above in the forum but not resolved.

  • I have a similar question about altering an existing maintenance, but that is less important since delete and re-create would achieve the same thing (assuming scheduling is disabled while doing the delete/re-create).

  • I am reading the current EDD statement “Overlapping running jobs are ignored and it is up to the admins to deal with these jobs” as only relating to the server confirming the maintenance on the specified nodes where jobs may already be running. If an admin does nothing with these jobs in the time leading up to maintenance start, what happens to an “overlapping running job” when the maintenance actually begins on the vnodes where the job is running (killed, requeued, nothing, etc.)? If this is part of what the admin is supposed to deal with the EDD should be explicit about this.

  • What state does a vnode enter once the maintenance begins? We have a “maintenance” state, is there any reason not to use this (@bhroam?)?

  • Can we consider setting a vnode’s comment attribute to note the upcoming maintenance event upon confirmation? Probably not needed if the existing “resv” attribute will show the maintenance name, or maybe we make a new attribute for this?

  • We will probably need to either add a new job comment to say it is not running because it conflicts with node maintenance, or modify the existing comment regarding reservations/top jobs to include this.

  • The EDD currently says “admins and managers”. I don’t think “admin” is a rigidly defined term in PBS Pro. Was this supposed to say “Operators and Managers”?

  • What about adding an option for a “complex-wide maintenance” including all nodes so that all of the hosts/vnodes do not need to be listed?

  • What about the possibility of submitting jobs to the maintenance? An admin may want to run some test jobs to make sure things work before ending the maintenance. (This plus the above would make this feature completely replace dedicated time, I think).

  • If these maintenances are as similar to existing reservations as we are discussing here then I believe the EDD should be explicit about what reservation related hook events do or do not trigger for a maintenance.

Hey @scc
The idea behind this RFE is to use normal reservations that are confirmed in a non-normal way. Today we can’t use normal reservations because of the confirmation rules the scheduler follows. This way would have a different entity confirm the reservation.

If this is the way we go about it, then pbs_rstat would list the reservations and pbs_rdel would be used to delete them. Is your concern that since we have a pbs_msub, we should have a special command to stat and delete them? What if we put the pbs_msub code into pbs_rsub with a new option? This way you are using the standard pbs_r commands for all reservations.

For your second point, if we go the reservation route, pbs_ralter is available to modify the maintenance reservation times. This isn’t a full alter, but as you say you could delete and recreate them.

For your 3rd point, I read the point as ‘PBS will do nothing to jobs that overlap a maintenance reservation’. I think this is the right way to handle it because we already have two precedents. If you create dedicated time while there are running jobs and those running jobs will spill over into dedicated time, we do nothing. The second one is if a job’s exceeds its soft walltime and it runs into dedicated time, we do nothing. It is up to the admin to deal with these jobs as they see fit.

For your 4th point, I personally like the idea of leaving these reservations as normal reservations. This means the state of the node would go into resv-exclusive. Otherwise we would have special case code for these reservations, and I’d rather avoid that. If we just leave these alone as normal reservations, there is nothing saying you couldn’t create a standing maintenance reservation. The new command would have to be smart enough to confirm one though.

For your 5th point, all reservations show up on the node’s resv attribute

@vchlum a couple of thoughts:

  1. You’ll need to also create resv_nodes for the reservation. That is required to confirm it. You might want to add that to the document. If you want to support standing maintenance reservations, they are confirmed differently. You need to provide the resv_nodes for each occurrence.
  2. To handle nodes that are sharing = ignore_excl, you will probably want to request all the resources on then node as well as requesting place=exclhost
  3. You say the server confirms these reservations. Did you want to do that or have the command do it like @dtalcott is doing it?
  4. I suggested above that you might put this code into pbs_rsub. This has the benefits of not having the discrepancy of not having a pbs_mstat, pbs_mdel, or pbs_malter (or worse, writing them).
  5. There are some pluses of marking these reservations special. You could put these nodes in the maintenance state. If you do this, be careful though. The existing admin-suspend functionality will move nodes out of this state. You’ll have to deal with that. You could also possibly change the preceding letter from ‘R’ to ‘M’ . You could also change the job comment like @scc suggested. The only major downside I can see is that there is a beauty of leaving these as normal reservations. I think it is a cleaner implementation and it is more future proofed since anything that modifies reservations will also modify these.
    Bhroam

Thank you for comments @scc and @bhroam. Some of the comments are in the EDD draft, some will follow. I like the idea to treat the maintenance as a reservation with non-standard confirmation more and more. @scc, you have several good ideas (set comment, use maintenance state). These things request special treatment and they make the maintenance more special. I like the idea to keep it as similar to reservations as possible since the reasons @bhroam mentioned, but I would slightly prefer to use ‘M’ instead of ‘R’ for the maintenance prefix. The different prefix will provide an easy way of how to distinguish the maintenance (not only) in pbs_rstat. It would indicate that the ‘reservation’ was confirmed in a different way. But it is not necessary.

  • I would go with modifying the existing comment for conflicts.

  • I thought about ‘complex-wide maintenance’ too, but I think the scheduler would be needed for the complex select.

  • The reservation hooks should work but I am afraid there will be a trouble with the future hook event ‘on confirmation’ (PP-1289).

Vasek

Thanks for the comments and updates from both of you.

I like just adding a new option to pbs_rsub, it cuts down on confusion about other “missing” pbs_m* commands or the extra work to add them.

I like changing the prefix to M, it could cut down confusion if a maintenance reservation overrides someone else’s existing “normal” reservation.

Thanks @bhroam for listing precedence for doing nothing to these jobs. I am fine with following that. The only possible upshot of requeuing them if their walltime shows they will not finish before the maintenance is those resources could possibly be given over to backfilling shorter jobs that could finish, but if this is a real opportunity we want to pursue later we could revisit.

I think having a quick way to do a complex wide maintenance reservation is important. @bhroam, do you have any thoughts on how this could be done given the current thoughts on how this might all actually work?

Any thoughts from either of you on my last 2 bullets above? The EDD is the EXTERNAL design, so even if we know that internally this will be largely implemented via existing reservations where jobs can of course be submitted to them we should still call out any desired/intended functionality (like test jobs), and hook event triggers are always good to note explicitly to avoid confusion.

For the hook events, at first thought I don’t think we’d want/need them to trigger. Most of the use cases surrounding reservation hook events are to possibly deny the reservation at submit time or to integrate with an allocation management type system for billing, neither of which really apply here (though I am less sure about the allocation management case). If we do let PBS fire the hook events off then a well written hook at a site using this feature could detect the type of reservation and simply accept() (assuming we expose this to the hook, which would belong in the EDD, so some content is needed either way).

Again, thanks!

I like the idea of complex wide reservations. The command could definitely create one by doing a pbs_statvnode() and crafting the select/resv_nodes properly. This would be heavier on the scheduler though. Right now dedicated time is super cheap for the scheduler since it knows it is an all or nothing sort of deal. If we do complex wide reservations, I’d like to get rid of dedicated time in favor of them. Maybe we could come up with some special “vnode” name that means the whole system (e.g. ‘(PBS_ALL)’). If the scheduler saw a reservation with resv_nodes of that, it could convert them to something like dedicated time

As for hooks, I’d assume that site hooks are already properly written. I’m assuming that if they had a hook that rejected reservations, they’d allow the admins to submit anything they want. The maintenance reservations will be coming from the same admins.

I am also happy with using the ‘M’ prefix. We already have ‘R’ and ‘S’, so it isn’t like all reservations start with ‘R’. It does mean we’ll have to have state on the reservation that marks them as such. If this is the only reason we use that, then I think it is fine.

As for the comment changing, keep in mind that partial system reservations won’t necessarily give you the comment about conflicting with a reservation. You get a message from one node. Another note is that this message only happens for the normal code path. The node bucket code path gives a much more generic message.

Bhroam

Thanks. I like the udea of something like PBS_ALL.

Good point about the hook writers (usually) being the same admins who are making the reservation. Triggering the hooks is fine.

Yes, for the job comment, if this IS a reservation then the existing comment is fine (when it comes up, as you allude to there may be many reasons a job can’t run but only one comment).

Thinking more about the test jobs thing I harped on above and overall how much detail needs to go into this EDD, if it were slightly more strongly worded to say this IS an instance of what is already defined as a PBS reservation (rather than creating a new thing called “the maintenance” as the document is currently written) then it could just call out any explicit differences from reservations then I would be happy. This saves not only calling out something like job submission to it but also the details of the existing reservation attributes, which would clearly be a waste of time.

Really nice work – thanks @vchlum and all. I really like it!

I would suggest that handling “complex wide reservations” specially is a fine extension that should be a separate, follow-on contribution. It’s not needed now and it’s a natural extension (that fits with the current design in a backward compatible way). It might be better to get feedback on this, then extend if desired. Plus that makes it simpler = easier to review and less likely to have bugs (and unforeseen consequences).

There will inevitably be times when admins want to specify something like PBS_ALL except A, B, and C. We could explore the use of sets to make this more flexible.

As for PBS_ALL, I can tell from our experiences… It is very unlikely that we would like to put the whole system in maintenance. Our PBS operates several clusters and I would rather appreciate the possibility to put the whole cluster in maintenance instead of the whole system. Something like ‘-l select=24:cluster=<clustername> -l place=exclhost:scatter’ I can imagine that PBS_ALL could be useful for someone, but from my point of view, it is not a must-have feature.

@scc mentioned that he would like just adding a new option to pbs_rsub. Is it settled that it will be a new command? I think this is quite important for how the maintenance feature will be treated. Since we want to confirm the maintenance in the command, I think it should be a new command, but maybe it could have some other name (??) in order not being confusing about missing pbs_m*.

Vasek

If you want to submit a reservation like -lselect=24:cluster=foo, you will have to reimplement some of the node searching algorithm from the scheduler in this new command. If you just take a list of nodes, then it is easy. Once you start needing to do resource matching, then it gets harder. As long as you don’t do consumable resource matching, it shouldn’t be too hard.

In the example you are taking 24 nodes which aren’t named. I’m assuming this is the entire cluster? Maybe don’t take a number and implement a select like language which you say ‘cluster=foo’ and it just takes the entire foo cluster down. I don’t see the use case for saying less than a cluster without specifying the nodes. You won’t know which nodes you will get.

I guess I don’t really care what command it goes into. My thought was that if you stat it with pbs_rstat, delete it with pbs_rdel, and alter it with pbs_ralter, why not submit it with pbs_rsub like other reservations?

Bhroam

Yes, the example was for the entire cluster. I am fine with just enumerating all the nodes, and I just wanted to say that PBS_ALL is not a must-have feature in my opinion.

Sorry, @bhroam I probably overlooked your sentence:

What if we put the pbs_msub code into pbs_rsub with a new option? This way you are using the standard pbs_r commands for all reservations.

…and I got a bit confused:-) I don’t mind this way either. Trying to think up with the option for pbs_rsub, it could be maybe ‘-h <list of hosts>’ or ‘-v <list of vnodes>’, and if ‘-h’ or ‘-v’ is present then ‘-l’ is forbidden.

Vasek

In terms of enumerating the nodes, I suggest we make it a bit easier for the administrator with one of the two options:

  1. Allow hostname wildcards on the command line, or
  2. Allow the list of nodes to be read from stdin.

This was if there are a large number of nodes we could avoid making multiple calls to pbs_msub (or whatever we end up calling it).

Thanks,

Mike

How about mirroring the -H option to qrun as an additional option to pbs_rsub? It already has several nice parallels, e.g.,

  • In order to execute qrun, you must have PBS Operator or Manager privilege.
  • Do NOT use this option unless you know exactly what you are doing.
  • With the -H option, all scheduling policies are bypassed …

We could even restrict the syntax to the subset of what qrun accepts that is exactly the list of nodes (and report an error otherwise). When/if additional use cases arise, it can be easily expanded at that time.

See the PBS Professional 18.2 Reference Guide, RG-177:

The -H option is already taken for ‘<auth_host_list>’ in pbs_rsub.

@mkaro Are hostname wildcard strong enough? Maybe we should allow using regular expressions for the node selection. What do you think?

Vasek

Re: hostname wildcards, etc. – I don’t think it makes sense to do something unique for this enhancement. This may be a great use case, but I’d rather see it apply generally, across all places in PBS Pro, rather than make one (unique) exception here.

My feeling is that this should be saved for a later enhancement.

Bummer about -H. In any case, it would not be good to invent yet another spec language – we should re-use something that already exists, perhaps in a restricted form.

Maybe use a – double dash argument, e.g., --force-reservation?

@vchlum - Regular expressions would be an improvement over wildcards. However, I’m not sure we want to add regex code to the command itself if we can simply pipe a list to stdin. Admins can use [e]grep to prune the list as appropriate. It would be nice to have a -1 (dash one) parameter for pbsnodes, similar to the ls command to list one node per line. So you could do something like…

pbsnodes -1 | grep node000[0-9] | pbs_msub -c 'Node down for maintenance'

I think this is consistent with the *nix way of doing things.

I would like to push this forward and start to work on this. So I added an ‘alternative two’ to the design doc, which is the option to pbs_rsub, and I would like to pick one with your help. I am OK with both solutions. Since ‘-f’ is not taken, I think it is OK to use it as an option to pbs_rsub. Of course, we can use double dash with the full word. I also added the possibility to read the list of nodes from stdin.

Vasek

I like alternate 2 myself. I don’t think we need to provide both vnode and host support. I’m of two minds of which to provide. If we model this after qrun -H, then we provide vnode support. On the other hand, I don’t really see a use case to take down anything but a full host for maintenance. Maybe we should only allow the admin to list hosts?

I’m a little confused about how you said the select statements will be host=foo+host=bar, but the resv_nodes will include all of the ncpus of the host. Shouldn’t both of them contain ncpus?

Bhroam

I think the host support is more sensible. We really don’t want to maintain anything else than the full host. The select statement is changed and it contains the ncpus now.

Vasek