Allowing to schedule node maintenance with a possibility to run new jobs until the maintenance begins

Hi,

I would like to suggest a new node attribute. The reason for the new attribute is to be able to schedule node maintenance with better node utilization.

Please see the EDD for more details and let me know what do you think.

Vasek

Hey @vchlum
Thank you for writing up the EDD. A couple of questions

  1. Can’t you achieve this behavior by submitting a reservation on the requested nodes for the maintenance period?
  2. What if you want to schedule multiple maintenance windows close together? With only one attribute, it will make it difficult to do. This lends itself to reservations again. Multiple reservations can be submitted for any time window.
  3. How will this affect calendaring? dedicated time and reservations have a start and an end. This allows the scheduler to know when the node will be back up. It can schedule jobs on nodes after the maintenance. From the sounds of it, the maintenance has no end time. Does this mean the scheduler can’t plan to use this node until it comes back from maintenance? If this is the case, large jobs can get hurt. If enough nodes go down for maintenance that large jobs can’t run, the scheduler will give up on them. This means no resources will be saved for them. We’ll pick back up when the maintenance window is removed from the node, but we start over. If multiple maintenance windows happen in short order, large jobs may never run.

Bhroam

Thank you @bhroam for the comments.

  1. We use the reservations for scheduled maintenance now. The main issues with reservations are the following:

    • With a running job on a node, the reservation is not confirmed if the expected end time of the job exceeds the start time of the reservation. It is always necessary to get to know the last end time of all jobs that run on maintained nodes in order to confirm reservation. The maintenance cannot start before the last job ends. It can be necessary to schedule the maintenance on an earlier date sometimes. This is also not a script or puppet friendly.

    • The next and bigger issue with reservations is that if I want to schedule the maintenance for a whole cluster (let’s say 10 nodes) but one node is temporary down for some other reason, the reservation will not be confirmed. I am able to create the reservation only for 9 nodes and I need to wait for the last node to come up, and later I can create a second reservation for the last node. And if there are several nodes of the cluster down, it is very complicated to schedule such a maintenance.

  2. Yes, that is true, only one maintenance window is easy to schedule with the proposed attribute.

  3. Thank you for pointing this out. I take large jobs very seriously. The idea was that there is no end time of the maintenance, but how about to add also an attribute ‘available_after’…

Anyway, your questions lead me to this: What would you think of improving the reservations itself? How about to add a new ‘force’ option to the pbs_rsub and with this option the reservation will be confirmed even if the nodes are unavailable right now but with sufficient resources? The reservation would be degraded from the beginning. Also, the running jobs would be ignored with such a ‘force’.

Vasek

1a is s hard situation. The job told PBS that it will take a certain amount of time. You want to take a node away from it before it is over. What should PBS do? While this is not the easiest solution, but you can do a qalter -lwalltime and shorten the job. This will allow you to submit the reservation at the right time.

2a is tricky. You are not the first person to complain about PBS not allowing you to submit a reservation on a down node for purposes of maintenance. Your idea of a reservation with force is interesting, but complicated. What would force mean? Do you want to allow the scheduler to confirm a reservation on a node a job is running on? Is it only for down or offline nodes? If it is the first, you will want to do a relatively complicated node search where you get as many free resources first, and then make a second pass and find as few used resources as possible. If it is just down or offline, then it is easier, but still not as straight forward. You would still probably want to try and satisfy the node solution with up and online nodes before you tried any nodes which are offline or down.

I guess a third option is to only allow force to work with chunks that have host/vnode in it. This seems rather limiting.

I really think something like reservations is the right solution here. Just having an attribute (or pair of attributes) for the next maintenance doesn’t allow for multiple maintenance windows which are close together. Also you have to do either a complex qmgr command or many qmgr commands where one pbs_rsub will do.

When you want to ignore the scheduler for a job, you do a qsub -H. Maybe something similar here for pbs_rsub? You would have to be smart enough to degrade the reservation if any nodes are down or offline.

Bhroam

I really like this idea (assuming it is restricted to PBSPro managers or maybe operators) – it would solve the maintenance issue nicely.

Two things to think about here:

  1. What should happen if any overlapping running jobs are not done by the time the reservation starts… Since the reservation was made by a Manager/Operator, an easy choice would be to start the reservation and let the Manager/Operator handle the potential oversubscription manually, e.g., by waiting for the job(s) to finish, suspending them, or killing them.

  2. What should happen if there are overlapping reservations already in the system? Again, one could decide that it’s the Manager/Operator’s responsibility to address and go ahead and oversubscribe the resources, but I’m not sure that PBS Pro would support this in the current code paths. Note that it may be problematic to ask a Manager to delete a future occurrence of a standing/recurring reservation, so more thought might be needed for this case.

Thx!

@billnitzberg Yes, the ‘force’ option would be only for managers or operators.

  1. I think that the manager/operator should decide what should happen with overlapping running jobs. I think the forced reservation does not need to handle this. We can simply let the overlapping jobs run and once the reservation begins manager will decide what to do.

  2. Since we would let the overlapping jobs run, I would say the correct way would be to oversubscribe the previous reservation and let the previous reservation untouched, but we also have a question whether to oversubscribe ‘forced’ reservation with ‘forced’ reservation. This seems to be kind of complicated for the scheduler.

@bhroam I think it is not necessary to prefer up-nodes before down-ones with the ‘force’ when the nodes are selected. Since it is only for manager/operator, we can assume he/she knows what he/she is doing.

One more thought, how about to add something different. I mean not a ‘reservation’ but real ‘maintenance’ object - reservation-like, but this object would not allow submitting a job in it (is it a problem? sometimes it can be maybe useful to submit a job to a node in maintenance), and this object would oversubscribe everything. Would it be easier for the scheduler?

Vasek

@vchlum
The maintenance object is an interesting idea. Unlike a reservation, it would not need to go to the scheduler for confirmation. We could treat it like a qrun -H where you give a +'d list of nodes. The scheduler would need to be aware of them so to not run jobs that would cross into them. If we provided the ability to say ‘all’, it could obsolete dedicated time.

The way we run jobs in dedicated time is we the scheduler creates a set of queues that all have the same prefix (‘ded’ by default). Any job in a dedicated time queue can’t run unless we are in dedicated time. We could do something similar, or just create a queue like we do with a reservation. The only reason I made the dedicated time prefix is so I didn’t add a queue name to the sched_config file. Queues can be added and deleted willy-nilly by qmgr. I didn’t want to have lingering queue names in a file.

The scheduler will need to be smart enough to understand when there are overlapping reservations. While the scheduler would deny reservations that were attempted over a maintenance window, any reservations that are already confirmed would still overlap.

The question comes back to reservations or these new maintenance objects. They are very similar in nature. I don’t want to create the same feature in PBS twice, but are they similar enough that overloading reservations for maintenance is good?

On one hand, there is a whole lot of machinery surrounding reservations that we wouldn’t need for maintenance objects. Maintenance objects don’t need to be confirmed or degraded/reconfirmed. On the other hand, a maintenance object is a set of resources blocked out for a certain period of time. One that only a certain set of users can run work in. That is basically the definition of a reservation.

My opinion is that reservations and maintenance objects are a little too similar to make both. What do you think?

Bhroam

@vchlum
I thought I’d poke this conversation to see if you had more thoughts on this topic? The topic kind of died down over the holidays. Now that we’re all back, it’s a good time to pick it back up.

Thanks,
Bhroam

FWIW, We have been using the equivalent of forced reservations for years. That is, we create the reservation with a desired node list and then immediately confirm it ourselves with the same node list, before the scheduler sees it. (We have a patch so the server doesn’t get upset by multiple confirmations.)

Just recently, we started a process that might result in conflicting reservations. So far, that hasn’t happened. Could be interesting when it finally happens :slight_smile:

Sorry for the delay @bhroam. I had long holidays this year:).

Despite it would be very nice not to go to the scheduler for confirmation, I agree that reservations and maintenance would be too similar. Based on what was written, I believe it means that the suggested ‘force’ option to pbs_rsub is suitable, and the improved reservation could be used for maintenance.

Hey @vchlum,
I’ve been thinking about this more with your use cases in mind. There might be a hybrid approach that gets us everything we want and is easier to implement.

The issue with forcing a reservation is it is more flexible than is required for maintenance. For maintenance you want to take whole nodes down. If we do a reservation with force, we’d have to take care of cases where we want to allocate smaller requests. It also has issues with down nodes that the server has no resources for yet.

What if we create a new command that just takes a list of hosts for maintenance. Internally the server will create a reservation and internally confirm it. All of the current reservation code will be used to start and stop it. The scheduler doesn’t need to be modified at all since it will see these as confirmed reservations.

The only two hitches are multi-vnoded machines and degraded reservations. The first can be solved by the server. It knows how many vnodes are on the machines and can craft a select statement for the reservation. The other can be handled either by crafting a select properly that requires the nodes requested, or by adding a bit saying to ignore these reservations when confirming/reconfirming reservations.

This saves us from reimplementing something very close to reservations, but it also saves us from implementing something more flexible than we need at this time.

Bhroam

A “force” option for reservations is a really good option – it extends an existing interface in a natural way, and captures the use cases for maintenance nicely (including multiple maintenance windows, separate maintenance on different parts of the system, etc).

Hey @dtalcott – perhaps you could share your code with @vchlum as a possible starting point for a (slightly more) general version of the enhancement?

As far as the various minor issues that need to be addressed… some can be left for the manager (delete or empty other overlapping reservations, kill jobs that are still running when the reservation starts or is made, etc.), some may be able to be ignored (nodes without resources reported to the server are probably a very rare exception and the manager can just set them offline to handle the maintenance, etc.). And, for maintenance, one only cares about whole nodes (I think), so one could require “node exclusive” when using the force option to simplify the implementation (while still handling the use case well).

@bhroam @billnitzberg
The hybrid approach seems to be a good idea. I try to realize whether I ever want to force the reservation without the exclusive node option and I do not see such a use case. I think multi-vnoded nodes could by simple solved be adding -l place=exclhost to the “reservation” by default. It also makes sense to confirm the maintenance internally and immediately by the server.

With the hybrid approach and the new command (let’s say pbs_msub), some new commands as an alternative to pbs_rstat and pbs_rdel should be also added, but I think these new commands (pbs_mstat and pbs_mdel) could be too similar to its reservations brothers, do we want such a similarity? I don’t like to add pbs_msub and do not add pbs_mstat or pbs_mdel because it could be misleading even for the managers.

What if every reservation created by pbs_rsub with ‘force’ option would be internally and immediately confirmed by the server without waiting for the scheduler (so we do not need to modify the scheduler too)? I think it could work… or maybe some obscure forced reservation could brake something?

Definitely, some issues can be solved by the managers. Especially, I think it is not suitable to automatically affect overlapping running jobs. I believe we can assume that the manager knows what he/she is doing in such cases.

@dtalcott Sounds interesting and I would definitely have a look at the patch in case of sharing.

Vasek

The code is straight-forward. Using whatever logic is appropriate for your use case, build a list of nodes and resources you want to reserve (e.g. ncpus). Then stop the scheduler (so it doesn’t sneak in and reject the reservation before you can confirm it [unlikely, but just in case]). Make the reservation, then confirm it. Then restart the scheduler if you stopped it. I’ll paste the pertinent code below.

Note that this code is currently running against a PBS 13 server and scheduler. Will be upgrading to PBS 18 soon. Will need mods to deal with multi-sched. Also, note that the pbs_confirmresv() call is not normally available to python, but a tweak to the swig input file takes care of that.

def make_confirm_resv(conn, attrd, nodes, ninfo, start):
"""Make and confirm a reservation.

Make a reservation from a dictionary of attribute values.

Args:
    conn = Connection to server
    attrd = Dictionary of attributes
    nodes = List of desired nodes
    ninfo = Dict giving extra info for nodes
    start = Epoch time for reservation to start
Returns:
    Name of reservation on success, else None
"""
if conn == None or attrd == None or nodes == None or ninfo == None:
    logging.warn("Bad call to make_confirm_resv")
    return None
start = int(start)
# We need to stop scheduling so we can confirm the reservation ourselves.
bu = BatchUtils()
atl = bu.list_to_attrl(['scheduling'])
t = pbs_statserver(conn, atl, None)
saved_sched = t[0]['scheduling']
if saved_sched != 'False':
    atl = bu.dict_to_attropl({'scheduling': 'False'})
    t = pbs_manager(conn, MGR_CMD_SET, MGR_OBJ_SERVER, '', atl, '')
    if check_pbs(t, conn, "stopping scheduling"):
        return None
# Copy the attribute list, adding a select attribute for desired nodes.
nattrd = attrd.copy()
nattrd['Resource_List.select'] = nodes_to_select(nodes, ninfo)
rname = nattrd['Reserve_Name']
# Make sure the start time is in the future
earliest = int(time.time() + 10)
if start < earliest:
    start = earliest
nattrd['reserve_start'] = str(start)
# Create the reservation
logging.debug("Making reservation %s at %d" % (rname, start))
resvid = make_resv(conn, nattrd)
if resvid != None:
    # Now confirm the reservation
    confirmlist = nodes_to_confirm_string(nodes, ninfo)
    rc = pbs_confirmresv(conn, resvid, confirmlist, start, 'PBS_RESV_CONFIRM_SUCCESS')
    check_pbs(rc, conn, "confirm reservation %s" % resvid)
    if rc:
        resvid = None
# Restore scheduling
if saved_sched != 'False':
    atl = bu.dict_to_attropl({'scheduling': saved_sched})
    t = pbs_manager(conn, MGR_CMD_SET, MGR_OBJ_SERVER, '', atl, '')
    check_pbs(t, conn, "restoring scheduling")
return resvid

def make_resv(conn, attrd):
“”"Make a reservation given an attribute dictionary.

Use the attributes to create a reservation.

Args:
    conn = Connection to server.
    attrd = Dict of attribute name, value pairs.
Returns:
    id of reservation on success, else None
"""
bu = BatchUtils()
atl = bu.dict_to_attropl(attrd)
result = pbs_submit_resv(conn, atl, None)
rc = pbs_Errno()
if result == None or rc:
    rc = check_pbs(rc, conn, "submit_resv")
    return None
if not 'CONFIRMED' in result:
    return None
resvid = result.split()[0]
return resvid

@dtalcott flipped the issue on its head. It’s like the hybrid approach, but uses reservations directly. It submits a reservation, and then has an external entity confirm it. This means neither the server or the scheduler needs to be modified. I like it. It saves us from sending the new reservation to the scheduler to get it confirmed. The new command will do that directly. If we confirm the reservation on down nodes, it will get degraded. As long as we craft the select statement properly, it will never be moved because it requests specific hosts.

The only downside is to do this in python, you need to use a swigified version of the IFL library (like @dtalcott did). We do ship this for our hooks code, but I don’t believe it is directly available via pbs_python (@bayucan would have to comment on this).

@vchlum what do you think about doing this completely outside of PBS?

Bhroam

@bhroam Regarding not modifying scheduler, do the scheduler deal with overlapping reservations correctly? My concern is especially about the calendar?

@dtalcott 's approach is interesting. I like the solution since it is not invasive. On the other hand, I feel like the feature would be outside of PBS, and I think the maintenance feature should be pretty tight with PBS…? This is just my feeling though. I personally consider the maintenance feature to be important …but rather it is as important as the community needs it.

@vchlum you have made good points. The scheduler will most likely run both reservations at the same time. This means if a node is available at the start of a cycle with two overlapping reservations, it will likely be used by both reservations and oversubscribed. The end result being that a job will be running on a node that is going to be taken down for maintenance.

I’m not sure how to fix this. If a reservation is in the system and confirmed and another comes in and is confirmed on top of it, we could immediately mark the original degraded and attempt to reconfirm it. We’d have to make a decision on what to do with the overlapping nodes if we couldn’t reconfirm it. We could either add a priority to reservations which say one is more important than the other, or say the more recent reservation always wins. This would only work since the only way we get overlapping reservations is with our new maintenance command. The scheduler will never overlap reservations.

In any case, there needs to be daemon code changed. I’m still not sure I mind the majority of this RFE being in a command. It’d be a supported PBS feature, it’d just be that the implementation would be part of a PBS command rather than code in the server.

Bhroam

@bhroam OK, let’s do the new PBS command in python. I will prepare a draft of the design doc. Is it OK to use the name ‘R<id>.<hostname>’ for the maintenance? I don’t mind to use it. On the other hand, we could use something like ‘M<id>.<hostname>’…

I think that adding the priority is a good idea in order to resolve the overlapping problem. This priority could be opaque to the users or admins. The priority could be just higher for the maintenance and lower for the regular reservation. I think that the solution with the most recent reservation to win is not ideal because if somebody would submit a regular reservation after and over maintenance one then he/she would be actually able to run a job during the maintenance window.

Vasek

I don’t mind the maintenance reservations having the usual ‘R’ prefix. I wouldn’t mind them having something like ‘M’ either, but I’m not sure it is necessary.

If we went with the most recent reservation wins method, we wouldn’t have a problem with user reservations winning. The scheduler will not confirm an overlapping reservation. The only way you can get an overlapping reservation is with the new maintenance command. I like and dislike this idea. I like it for the simplicity. We don’t have to add any extra ‘this one wins over that one’ code to the server. When a new reservation is confirmed and overlaps, it wins. I dislike it because future bugs/features could fall into a trap where they unintentionally overlap and win over maintenance reservations.

How would we handle the overlap? My thought is to have the server check for the overlap at confirmation time and rewrite the resv_nodes for the losing reservation. It could then mark the losing reservation as degraded and trigger a scheduling cycle to reconfirm it.

Bhroam

@bhroam Yes, right. The most recent reservation wins is safe. It makes sense to mark the losing reservation as degraded ASAP. I like your way how to handle overlap and I do not see a better one. The EDD draft is updated.

Vasek