PP-337: Multiple schedulers servicing the PBS cluster

All,

We have added couple of more error messages to Interface 5: Changes to PBS Nodes objects.. Requesting the community to review the updated EDD and provide your feedback.

Thanks,
Suresh

Hi All,

Added notes on Server’s backfill_depth behavior in Notes section of EDD.
Requesting the community to review the updated EDD and provide your feedback.

Thanks,
Vishesh

@visheshh as fas as I remember backfill_depth will be associated with policy object. It will not be part of server anymore. I understand that what you have written will be the behavior until we get a policy object but it would change once design for PP-748 gets published.

All,

Have modified the Multisched EDD to some extent which reflects the interface changes suggested during the review and hence requesting the community to review the updated EDD and provide your feedback.

Thanks,
Suresh

For easy access, here is the link to Design document. I find it hard to go to the top of the page to get the link.

The current design doesn’t talk about the pbsfs command. Each of the schedulers has its own fairshare tree. How does pbsfs know which scheduler to modify?

Right now the pbsfs command doesn’t care whether or not PBS is running. It modifies the scheduler’s usage database directly. It can do this because the sched_priv directory is well known.

I see two ways to go about doing this.

First is to have the admin supply a scheduler name to pbsfs. This will require pbsfs to talk to the server and get the sched_priv path for that scheduler. This will require the server to be running to run pbsfs.

The second way is to provide pbsfs the sched_priv path you want to modify. This keeps the freedom of running pbsfs without PBS running, but isn’t nearly as user friendly.

Opinions?

Bhroam

I think that this is a reasonable thing to require. We require this for many of our commands (qstat, qsub, qmgr, qalter, etc).

I’ve added interface 10 to the design document talking about fairshare. There is a new -I option to pbsfs to specify a scheduler. I chose -I because that’s the same option to pbs_sched to specify a scheduler name. I figured being consistent is good.

Please review

Bhroam

Hi @bhroam ,

How would "Configuring Entity shares” work ? Now that, we have fair share tree entity allocations on “per scheduler” basis ?
“Sorting jobs by entity shares” will no longer be “per whole PBS complex “ ?

Thanks
Latha

Interface 2 lists the resource_group as one of the files in the scheduler’s sched_priv directory. This is how an admin defines the fairshare tree. It will be per-scheduler. Nothing has changed fairshare wise. Before there was a resource_group file in the scheduler’s sched_priv directory. This is still true. It just so happens that now there are more than one sched_priv directory.

If you think it is needed, I can add another bullet point to Interface 10 saying as much.

Yes, anything fairshare will no longer be per-complex. It will be per-scheduler. I tried to cover this in interface 10 bullet 1. This means anything to do with fairshare will be per-scheduler now.

Bhroam

Any other comments about Interface 10 on Fairshare?

Bhroam

Interface 10 looks good to me. If it turns out to be a problem that the server needs to be running (which I don’t think it will be) we can always add a new option to specify a specific sched’s sched_priv directory since the server is not there to be queried.

1 Like

Thanks for posting the changes. Are you recommending -I or -l (L)? My suggestion is that we use an option such as -N instead of -l (L) or -I (i) since they can be confusing. But it is only a recommendation.

It’s an uppercase ‘i’. I believe it’s for id. I used it for consistency with the new option to pbs_sched. I don’t really care what option we choose, but I think it should be consistent between binaries.

If we keep the options consistent, then -N is already taken. It leaves the scheduler running in the foreground (it’s actually an option to all daemons).

Bhroam

Hi All,

Couple of minor changes are added to the following EDD. Can you please have a look and give your comments if any.
https://pbspro.atlassian.net/wiki/spaces/PD/pages/50947131/PP-337+Multiple+schedulers+servicing+the+PBS+cluster

Thanks,
Suresh

Looks fine Suresh, thanks for updating the page with accurate information.

Hi,

I am going to begin work to support submitting reservations that can be serviced by non-default scheduler in a multi-sched setup.
Now, there is already an interface that talks about this user case (Interface 8). But, I want to suggest a change to this interface.
The interface currently states that a new “-p” option for pbs_rsub command will accept partition names which will let server know which scheduler does it have to relay the request to. While this is a fine solution I feel it is probably better that we make it in a way that “scheduler name” is specified to pbs_rsub instead of specifying partition.
This approach will have two potential benefits -

  • Since partition is not a first class citizen there is no way to list them out. Schedulers, on the other hand can be listed using qmgr.
  • Users don’t have to assume that the reservation will be confirmed on a specific partition only. Since one scheduler can service multiple partitions, it is possible that the partition user specifies in pbs_rsub command isn’t the partition where the reservation is confirmed.

I also think that server should not let a pbs_rsub command without partition (or scheduler name) option go to the default scheduler. Instead, server should relay such request to all schedulers sequentially (or in parallel) unless it gets a confirmation from one of the schedulers. By doing this, in future, schedulers can potentially borrow nodes from any of the other schedulers (without knowing who has free nodes) to run jobs by submitting a reservation. This will make multi-sched more dynamic in nature and grow and shrink their partitions when needed.

I’d like to know what does community think about this interface.

I like your proposal Arun, my only thought is that do we also want to add some limit attributes on the sched object for per-sched-reservations? This might help admins control which reservations can go to which scheds since reservations can’t be submitted to queues and are cannot be controlled like jobs can, but some sites might want to dedicate some scheds to special purposes and might not like us sending resv requests to them. Just a thought.

I wonder if it’s possible to do this “automatically” without having the submitter specify a target scheduler at all?

Some of the original thinking around the multi-sched feature was that PBS would (eventually) do everything automatically, e.g., notice that there was a natural partition in the system and, when scale or potential speedup warranted, automatically start a new scheduler to handle it (and then automatically shut it down when it was no longer needed). I feel this is still a great goal. Another goal was to support multiple (different) scheduling policies, and for that, one would need to at least specify a partitioning of the jobs or the nodes (or both). In my opinion, the current implementation requires too much (over) specification by the admin (partitioning both the jobs and the nodes and manually creating schedulers). Explicitly designating which reservation should be serviced by which scheduler feels like an over-specification that would take the overall design in the wrong direction.

Is there a way to get (most of) the use cases without specifying a target scheduler nor a target pool of nodes? I bet in many cases, there may only be a single partition that would fit the reservation, so, for those cases, we could calculate which scheduler to use. How many real-world cases lack this property? What if we just tried the schedulers in some order (or a random order which would be better for supporting parallel schedulers without adding additional constraints) and the first one to confirm wins?

Thanks for your reply @agrawalravi90 and @billnitzberg!

@agrawalravi90 I like your idea of schedulers having limits, This will make it similar to how routing queues work. Server will give reservation to a scheduler that is ready to accept it. There are a few things that come to my mind though -

  • Currently limits are applied to only jobs, this will be a different kind of limit which will not be applied to the job. So I’m not sure how confusing it will be for admins (Maybe not that confusing if we name the attribute appropriately)
  • From the sound of it, it feels like this kind of limit should be set by scheduler itself. Some sort of publishing mechanism that this scheduling complex has these many resources available. If we make admins set it then, it will again be static and potentially a reservation could end up with a scheduler which might schedule it far out in future when other schedulers were capable of running it now.

I guess it is more of a question for PMs (@scc?) if they see it useful.

@billnitzberg I also feel that server should decide this automatically without making users specify the scheduler where the reservation should go. This is why in my proposal I said that server should relay reservation request to all schedulers sequentially (or in parallel) unless it gets a confirmation from one scheduler. We could also tweak it in a way that server selects the solution of the scheduler which can run the reservation sooner.
Maybe this is the only thing we should do for now, and not implement interface 8 of the document. If we really get a request to implement interface 8 from the field then we can always add that later.
What do you think?