I have created the design proposal for making reservations work for Multi-Server. We have divided the work into two phases due to the reasons mentioned in the design document. A separate discussion will be opened up for phase-2 in the future.
Requesting the community to take a look at the design and provide feedback for the same.
I just read this briefly and need to go back through it in more detail, but my initial comment is that I believe your assumption about most reservations fitting into a single partition is not correct for our facility. Here are some thoughts:
This is specific to us, but our mission in life is to run large jobs. We get graded by DOE on our percentage of jobs that are specific percentages of the machine. Full machine, single jobs are not uncommon. Jobs consuming 2/3 of the machine are quite common. Reservations over 1000 nodes are quite common.
You did not address maintenance reservations. Again, maybe this is us, but we place full machine reservations when we do maintenance.
Multi-server is to enable large scale. We are going to be one of the sites that likely takes advantage of that, but if I understand your proposal, if you only have phase I reservation support that would actually prevent us from using multi-server because we would not be able to user reservations the way we need to. Or at the very least make it extremely inconvenient if we had to end up having to put a bunch of separate reservations. You did not mention a phase 2 timeline. If that was going to be immediately after and this was just to get some functionality out faster, maybe it works, but it sounded more like you were hoping you never had to implement phase 2…
Off the top of my head, I like the idea you mentioned about making one server the owner of the reservation, and preferring that to be an instance that can handle the entire reservation, but in the exception case where it can not, have it be the “master” and the other instances be “workers”. The master relays, coordinates, and aggregates reservation related commands and data.
I will talk to our team and see if we can put together some more detailed feedback, but I wanted to put my concern on the radar sooner rather than later.
Before making the assumption that the most common reservation is a few hundred nodes, I’d poll the field team. They will have real world customer knowledge. @weallcock has already pointed out that this is not the case at his site.
You list job based reservations. If you go ahead with the phased approach as you have it, I think you’ll need to not support reservations that are requested through qsub. These are jobs that are converted into a reservation after the job is run. If the scheduler chooses to run a multi-node job across servers, what will you do with it when it hits the hook? Reject the conversion?
I don’t quite understand how you are going to submit a job in a reservation to the server the reservation is on. qsub/IFL doesn’t know if a job is in a reservation or not. You say you are going to broadcast to all the servers, but how does qsub/IFL know if it needs to? Right now a reservation queue is in the form [RSM]NNNNN, but that might not always be the case. Are you going to submit all jobs to all servers? That doesn’t sound scalable.
I don’t think phase 1 can be done as you have written it. What you are suggesting is to submit a reservation to a random server, and then see if it can be confirmed on only that server. That means a reservation will need to be submitted multiple times to see if it can be confirmed. If you want to colocate a reservation to a server, you will need to check all the servers, and then possibly move the reservation from one server to another. This will be a new type of move request because you currently can’t move reservations.
As for whole system maintenance reservations, there is still scheduler-side dedicated time. This will take the whole system down for maintenance. It just isn’t as useful or convenient as maintenance reservations.
Ya I think we should make this part of phase 1, or a phase 1.5 to be followed immediately after phase 1
Just thinking out loud, is it possible for us to add an option to pbs_rsub which will instruct the server to book all nodes that it’s aware of? then user could just say pbs_rsub --all or something similar and IFL would broadcast it to all servers, which will book their known list of nodes, and pbs_rsub will return ids of the reservations created? @bhroam and @suresht what do you think?
I (now) feel that this does not have to depend on how large the reservation is going to be. (a few hundred nodes or much larger, spanning partitions). I would still propose that even in the case of very large reservations, it makes sense to think of the reservation as a “sharded” entity rather than a “shared” entity.
The reason we were originally thinking about reservations as shared objects is because of the way it is implemented today, i.e. it is backed by a queue, and since queues are shared entities in multiserver design - so, why not reservations.
Keeping aside whether we are able (or agree) to change that implementation to detach it from queues, we need to step back and look at reservation as different from queues. Conceptually, a reservation is just a “resource block” to me - in other words, a request for resources (from a user) - so just like a large multi-noded job (except of course it can run jobs inside it, and has a start/end time associated etc).
Hence, it seems to me that it aligns more towards being a “sharded” things rather than a “shared” thing. This keeps the implementation simple. Sure, if it is large, it will need to span multiple servers, but that should be just like we share multinode jobs. We only let the partitions that deal with the reservation know about it (not all partitions to start with).
The point here is that even if the reservation spans multiple partitions, we still consider only one server-instance to be the actual owner of the reservation. This is the server-instance that will drive hooks related to that reservations events, send out emails - and so a single instance doing these is easier than multiple server-instances needing to be kept in sync.
And then, if we agree to change the reservation implementation to not be backed by a single queue, then we can allow job submissions to go to any server-instance (it won’t need to be sent only to the server-instance on which the reservation queue resides).
Or, perhaps, the best thing to do is to make that change to the reservation functionality before anything else…
Try to fit in one partition first?
This is more of a scheduler optimization, which may benefit some sites but not others. The point is if the reservation can fit entirely in one partition, they why scatter it all across (since that is more complex to manage and handle). However, if a site like ANL finds that such a scheduler optimization would not make sense for them (since most of their reservations will span multiple partitions), then we they should be able to flick a switch to turn off that scheduler optimization.
@bhroam - the idea is to assume that it is a reservation, use a hint to try to get to the correct server-instance at first attempt. But then, in the case that the queue is not a reservation queue, the server-instance rejects that and we iterate over the servers (this is the same thing we do with trying to locate jobs). The hope is that the hint works most of the time, and we are penalized for a small amount of cases.
However, much better thing would be if we can detach reservations from needing this backing reservation queue. Then, jobs submitted to a reservation does not need to land only onto one server-instance, it can land on any server-instance, since, the scheduler is one that will make sense of the “reservation submitted to attribute” of the job…
@weallcock our goal is to absolutely support reservation spanning across partitions (like we would support the same with multi-node jobs). It is just a step by step approach. The moment we have the first phase working, we would start the spanning work (this is what we did with multi-node jobs as well, we made work only in one partition, tested it out, then make it span - being worked on now). However, exactly we will finish phase2 is not known yet.
This exactly chimes with what we are thinking. (in fact, this is how our multi-node job design currently is, and i hope we can simply follow the same approach)
In the “simplest” cases contain it in one server-instance only
In the “spanning” case, keep one server-instance as the “master/owner” - this server-instance drives hook events, send emails etc. Other server-instances merely pokes the scheduler, if any nodes owned by them, but associated with this reservation goes down, so that the scheduler can “reconfirm” it
I’m all for iterative development. But it would have better if we have the whole design out. Right now, you do not convey how you are going to accomplish phase-2. If the design of phase-2 invalidates the assumptions you have made for phase-1, we might have to re-think it all over again!
For instance, you have mentioned that reservations will be attempted to fit in any one of the server partitions. Today the server does not have the capability to move reservations from one to another. If you do this for phase-1 and we learn in phase-2 that spanning reservation is performant and consistent enough, and decides scheduler do not have to do this extra effort of fitting the reservation into one of the non-local partition, that could make effort spend on moving the reservations wasteful.
This is one such scenario. I would love to see a consensus on the overall design and make sure that sharding the reservation makes sense as the first step towards that overall goal.
I think the overall strategy of how to span across partitions is mentioned, already.
I do not feel that spanning is lot of extra work. We only have to send information to other server(s), if
The reservation needs to span
At the confirmation and end of a reservation only - which, are not highly voluminous events
On the contrary, in a shared approach, we would do this for:
Reservations of any size no matter whether they fit one partition or not
Make the information available on all partitions (create/modify via IFL on all partitions)
The simple reason that I suggest we shard reservations is that there exists cases where reservations will actually fit in one partition. Consider the case of jobs converting to reservations, which is a case that some large sites use. Those jobs are typically multinode jobs, and we are trying to fit the job in a single partition. Now when this job becomes a reservation, what do we do? If we go with a shared approach, we will want to create this reservation on all the instances anyway. Quite wasteful - when you know that the job was already fitting one partition.
Besides, i think the volume of large reservations would be much smaller compared to smaller ones. Thus, in a sharded approach, we try to fit it in the local partition, but if it does not fit, it will span - and then we do that extra communication, at start and end only.
Now, implementation wise, i think the work is quite similar (i do not see much difference in performance anyway). In a shared approach, IFL will iterate through and create/update reservation at ALL instances. In the sharded approach, the duplication will happen at the time scheduler confirms the reservation (what’s the difference? Ballpark the work done by the servers/scheduler is the same).
The advantage of the sharded approach are:
Allow for a possible optimization to fit the reservation (and resulting jobs) in one single partition. I see this as a huge benefit. If we allow a small reservation to also span across randomly, then multinode jobs inside them will also span (since they may not fit one portion of one reservation) - so it seems natural to me to extend the benefit to reservations as well.
Having one owner/master (even in the case of reservations spanning partitions) for the reservation also seems architecturally simpler to me. (shared objects are a problem as we know already)
Consider the scenarios where one partition goes down. If we spread in a concentrated fashion, we have low risk of the reservations (and all jobs inside it) to become unavailable if it spanned that partition. The point being, spread only if you need to, else stay localized (this is an obvious architectural benefit)
I actually feel this can happen both ways. To make it shared we have to do work as well, make changes in IFL to send ceate/update/delete commands to all instances. If we change the approach, we will need to change it as well - so either way, if we change approaches, we will need to make a change.
I just want to comment here as we were an early adopter of mulit-scheduling. Sadly I don’t use reservations often (most commonly for maintenance work and even at that I’m just using basic reservation functionality… none of the more recent features specifically targeted to maintenance reservations.
I’ve literally gone through the exercise of reconfiguring queues, nodes, and partitions around before just to move the hardware I want to take offline with a reservation into the default scheduler (moving the hardware that was in the default scheduler out) just explicitly so I can create that reservation. It’s a pain (and I’ve screwed things up while doing it too losing fairshare history in the process)). I look forward to the day I can just make a reservation on hardware that’s aligned with the non-default scheduler.
For me today, jobs never span partitions. However, my partitions can be quite large (my largest is thousands of nodes). I’ve never wanted a reservation to span a partition, but I do see where it could be beneficial for maintenance reservations.