PP-337: Multiple schedulers servicing the PBS cluster

Ahh… I apologize in case of reservation there is nothing called “running it sooner” because reservations have their start and end time. Thanks @bhroam for pointing this out :slight_smile:
With this in mind, I think then server can just mark reservation as confirmed as soon as it gets reservation nodes from scheduler.

Hey @arungrover

I was never really fond of the current design where the user has to specify which partition they want to submit their reservation to. We abstracted partitions out by assigning them to queues and having users submit to queues. This goes in the wrong direction.

I like your idea of having the server send the reservations to all the schedulers to try and confirm it. I’m not sure which direction I like better: sending them to all the schedulers in parallel or one at a time. Sending them in parallel will cause race conditions and won’t be consistent. This will make things much harder to test. It will be much faster though. The opposite is true a reservation to a scheduler one at a time. It’ll be easy to test, but be slower.

@billnitzberg is correct about the initial vision for multisched where we don’t have the admin do the partitioning. It is done by PBS when the need arises. By having PBS handle figuring out where we confirm the reservations will be consistent with this vision.

So all that said and in the agile spirit, I suggest we go forward with that.

Bhroam

I am facing a problem while trying change PBS server to send reservation requests to all schedulers and then have one of the schedulers confirm the reservation.
The problem is that schedulers can be configured to serve more than one partition in the complex. Now, if server sends a reservation request to such a scheduler, with the present code it may just happen that the reservation gets confirmed on nodes either one or combination of partitions. This behavior does not match how jobs run in a multi-sched complex. All jobs run on nodes belonging to the same partition.

I want to know, how important it is for admins who are partitioning the system to have a reservation confirmed on a set of nodes that belong to the same partition?

If it is important for admins to place the reservation on nodes belonging to the same partition then we will have to change scheduler to make sure that it ensures that the reservation gets confirmed on all the nodes that are pertaining to the same partition. We will also have to store this partition name on the reservation because in future if reservation needs reconfirmation (because of ralter or degraded state) it gets the new nodes from the same partition.

If it is not important to have the nodes from the same partition then we can just store the scheduler name that confirmed the reservation. This will again help server to know which scheduler to relay the reservation request to in case a reconfirmation on the reservation is needed.

What do others think?

This is really for the PMs to answer, but I think that we could start with just restricting it to the sched, letting it schedule it however it wants across its possibly multiple partitions. Then, if users want to restrict it to be inside a partition, we can make it happen later, maybe with a sched attribute.

Interesting… my interpretation of having a multi-sched handle multiple “partition names” wasn’t that jobs did not span these, but was just a convenient way to specify one big partition by using multiple names. Does PBS actually specify that jobs will not be split across partition names when they are serviced by a single scheduler?

In its current form, PBS does specify that the jobs are not span across partitions even when multiple partitions are serviced by same scheduler.
Here is a excerpt from our admin guide -

Each multisched schedules only from the queue(s) in its partition(s), and only to the vnode(s) in its partition(s). Jobs do not span partitions, even if more than one partition is scheduled by the same multisched.

Since that is explicitly specified, it seems like it should apply to reservations too.

I agree with @billnitzberg. We should do the same thing we do with reservations as we do with jobs. If we didn’t, it would get really confusing for jobs in that reservation that split across partitions. Would those jobs have to run within the nodes only on one partition, or would the span partitions on the nodes of that reservation? This isn’t a problem if the reservation is only in one partition.

As for reconfirming, I think we should once again send the request to all schedulers. There was nothing special about the partition that reservation was confirmed on. It was just the random scheduler that answered the server first. In the days before multisched, when we reconfirmed a reservation, it could get completely different nodes. There is no reason we should change this. If we kept the reservation to the same partition it was confirmed on, then it might not be able to be confirmed while other partitions sat empty.

Now this all goes out the window when we start reconfirming running reservations. Jobs might be running on the nodes that are up. It is very important for a running reservation keep all of its nodes that are up, and we only find other nodes for those who are down. This means we’ll need to reconfirm it on the same partition it is currently on.

Bhroam

Do we know if users would like us to restrict jobs to partitions? If intuitively it makes more sense to tie jobs & reservations to scheds instead of partitions then maybe we should change the behavior of how jobs are scheduled instead?

I am not aware of any user request to lift the current job/partition restriction. I agree that jobs and reservations should behave the same way in this regard.

Beyond what @bhroam mentions regarding reconfirming running reservations, there would also be a problem if we had a confirmed reservation that spans partitions and an admin decided to change which multisched manages one of the partitions (not that I have data suggesting that admins DO frequently shuffle sched/partition associations).

Ok, sounds like it’ll be simpler to restrict reservations to partitions then. I do wish partition was a first class citizen so one could stat reservations/job associated with a partition, do changes to a partition directly rather than going about it via queues and nodes etc.

1 Like

Thanks everyone, I’ll work on making the behavior of confirming reservations similar to that of the jobs.

I am working on an implementation based on the discussion above (Oct 19th 2019 onwards) and I am facing a problem that I don’t have the right answer to. I was hoping to get some help from the community to get the problem sorted.

Based on the discussion above, I am making a change such that when a new reservation is submitted PBS server notifies all the available schedulers. As soon as a scheduler confirms a reservation, server marks the reservation as confirmed and rejects any further confirmation of reservation from any other scheduler.
As part of confirming reservation, scheduler also let the server know about the partition in which the reservation is confirmed because PBS server also needs to mark the reservation queue to the same partition (for jobs to run).

While this is working, problem I am facing is with maintenance reservations and with nodes being moved from one partition to another:

Problem 1: If a node has a running job on it or a confirmed reservation, An admin can potentially move the reservation from one partition to another. If this happens then all of a sudden the new scheduler to which it is moved to will not recognize the partition set on the reservation and will start rejecting it. This means no new jobs present in the reservation queue will run.
One way to fix this problem is to not allow admins to change the partition of a node as long as there is any job/reservation present on it. This means to move a node from one partition to another admin will have to first offline the node, drain all running jobs (maybe resubmit reservations) and then move the node to the new partition.

Problem 2: In case of Maintenance reservation there is no involvement from scheduler. It is the admin who specifies the hosts that are going to be part of reservation. This means that Maintenance reservations can actually span across partitions (possibly schedulers). Although, it is not that big of a problem since there are no associated jobs to run inside maintenance reservations but it will be the only reservation without any partition associated with it.

While we are discussing this, I’d also like to mention that in case of standing reservations, if a scheduler is servicing two or more partitions it will not confirm a standing reservation as long as all the occurrences of the reservations run on the same partition. example, if a scheduler is servicing partition P1,P2,P3 and a standing reservation (with 6 occurrences) has a possibility of running first 3 occurrences on nodes present in partition P1 and next 3 on partition P2 then scheduler will outright fail to confirm this reservation. The reason for this is because server will have to figure out a way to change partitions on reservation/queues on certain occurrences of the reservation. While technically it is possible, I don’t know if this complexity in the code is really needed or addresses any customer use case.

I’m not sure I understand this. Why would a scheduler reject a (newly moved) reservation which is part of the partition that it owns? if we change partition on resv to p2, sched which owns p2 should accept it right?

if a particular node which had a running reservation is moved out of its partition then wouldn’t that reservation just get degraded? it sounds similar to a node going down, it’s resources that aren’t available to the reservation’s scheduler anymore, right? It does sound like a good idea to not allow nodes to be moved if there are running jobs/reservations on them.

In that case, I think we should not allow partition to be set on a maintenance reservation at all, reject it from pbs_rsub itself.

That sounds okay, but just a thought, how about allowing multiple partition values to be set on a reservation? Then, it can just be set once to a comma separated list of partitions where it will run, which doesn’t have to be updated by the server for each occurrence. I realize that this will make things interesting if admins explicitly ask for partitions which belong to different schedulers, but we can just reject such a reservation. Again, just a thought.

That may not always be true. Maintenance reservations can have jobs submitted to them. The “problem” of jobs possibly spanning partitions in a maintenance reservation not being able to run is not really a problem in my view, though. The admin-only maintenance reservation already assumes the admin knows exactly what they are doing and they are expected to deal with things like overlapping jobs. Understanding that jobs cannot span partitions even if they are in a maintenance reservation that does seems in line with that thinking.

1 Like

Thanks for work on the design!

Seems reasonable to disallow problematic behavior, especially when there is no strong use case for the behavior, and make an admin do a bit more work in these cases. If a use case warrants extending it in the future, it can be considered… in the future.

I agree with @scc – no need to do anything here.

Again, seems reasonable to start small and expand in the future (if/when its warranted).

@agrawalravi90: Sorry about the confusion, You are right the scheduler should be able to serve that reservation. Although, based on the discussion I think moving reservation is not needed for now and we can do this in future (if needed).

Thank you all for providing comments. Based on the discussion I will do the following -

  • A node will not be allowed to move partitions until there is no running (or future) job/reservation present on the node.
  • Admins create maintenance reservations, so they know whether or not the reservation is spanning across partitions. If it is then it is up to the admin to run jobs spanning across partitions (using qrun command)
  • Standing reservations will be restricted to have all its occurrences confirm on the same partition. If it is not possible, then these reservations will be rejected by the scheduler.

Hi All,

I have made some changes to interface 8 of the document. These changes are related to how reservations behave in a multi-sched environment. Please have a look.

Design page - “https://pbspro.atlassian.net/wiki/spaces/PD/pages/50947131/PP-337+Multiple+schedulers+servicing+the+PBS+cluster

Thanks,
Arun

“Once a reservation is confirmed and partition is assigned to it, it can not be re-confirmed or altered in any other partition.”
I’m not sure I understood this. Did you mean to say that the partition attribute on the reservation & its queue is read-only and that it cannot be set explicitly?

“If an admin tries to assign a scheduler/queue/node partition name “pbs-default”, qmgr command throws error - “Default partition name is not allowed”.”
Why not allow this for nodes and queues? If an admin wants to change the partition on a node/queue from Px to default then this might be useful right?

Thanks for reviewing the document!

That is partially correct, I guess I should explicitly mention that partition name on a reservation can not be set explicitly and it will get set internally by server depending on which scheduler confirmed it.

Well if they just don’t set (or unset) the partition on node or queue it is automatically read in by the default scheduler. I did not want to add another way of doing it and/or affect the way it used to work for customers.