PP-337: Multiple schedulers servicing the PBS cluster

suresht · August 18, 2017, 9:03am

All,

We have added couple of more error messages to Interface 5: Changes to PBS Nodes objects.. Requesting the community to review the updated EDD and provide your feedback.

Thanks,
Suresh

visheshh · November 3, 2017, 8:48am

Hi All,

Added notes on Server’s backfill_depth behavior in Notes section of EDD.
Requesting the community to review the updated EDD and provide your feedback.

Thanks,
Vishesh

arungrover · November 3, 2017, 4:42pm

@visheshh as fas as I remember backfill_depth will be associated with policy object. It will not be part of server anymore. I understand that what you have written will be the behavior until we get a policy object but it would change once design for PP-748 gets published.

suresht · December 13, 2017, 10:12am

All,

Have modified the Multisched EDD to some extent which reflects the interface changes suggested during the review and hence requesting the community to review the updated EDD and provide your feedback.

Thanks,
Suresh

arungrover · December 13, 2017, 6:38pm

For easy access, here is the link to Design document. I find it hard to go to the top of the page to get the link.

bhroam · January 25, 2018, 11:22pm

The current design doesn’t talk about the pbsfs command. Each of the schedulers has its own fairshare tree. How does pbsfs know which scheduler to modify?

Right now the pbsfs command doesn’t care whether or not PBS is running. It modifies the scheduler’s usage database directly. It can do this because the sched_priv directory is well known.

I see two ways to go about doing this.

First is to have the admin supply a scheduler name to pbsfs. This will require pbsfs to talk to the server and get the sched_priv path for that scheduler. This will require the server to be running to run pbsfs.

The second way is to provide pbsfs the sched_priv path you want to modify. This keeps the freedom of running pbsfs without PBS running, but isn’t nearly as user friendly.

Opinions?

Bhroam

jon · January 26, 2018, 2:15am

I think that this is a reasonable thing to require. We require this for many of our commands (qstat, qsub, qmgr, qalter, etc).

bhroam · January 26, 2018, 8:09pm

I’ve added interface 10 to the design document talking about fairshare. There is a new -I option to pbsfs to specify a scheduler. I chose -I because that’s the same option to pbs_sched to specify a scheduler name. I figured being consistent is good.

Please review

Bhroam

lsubramanian · January 29, 2018, 8:14pm

Hi @bhroam ,

How would "Configuring Entity shares” work ? Now that, we have fair share tree entity allocations on “per scheduler” basis ?
“Sorting jobs by entity shares” will no longer be “per whole PBS complex “ ?

Thanks
Latha

bhroam · January 29, 2018, 9:55pm

Interface 2 lists the resource_group as one of the files in the scheduler’s sched_priv directory. This is how an admin defines the fairshare tree. It will be per-scheduler. Nothing has changed fairshare wise. Before there was a resource_group file in the scheduler’s sched_priv directory. This is still true. It just so happens that now there are more than one sched_priv directory.

If you think it is needed, I can add another bullet point to Interface 10 saying as much.

Yes, anything fairshare will no longer be per-complex. It will be per-scheduler. I tried to cover this in interface 10 bullet 1. This means anything to do with fairshare will be per-scheduler now.

Bhroam

bhroam · February 9, 2018, 2:15am

Any other comments about Interface 10 on Fairshare?

Bhroam

scc · February 9, 2018, 10:12pm

Interface 10 looks good to me. If it turns out to be a problem that the server needs to be running (which I don’t think it will be) we can always add a new option to specify a specific sched’s sched_priv directory since the server is not there to be queried.

jon · February 13, 2018, 9:13pm

Thanks for posting the changes. Are you recommending -I or -l (L)? My suggestion is that we use an option such as -N instead of -l (L) or -I (i) since they can be confusing. But it is only a recommendation.

bhroam · February 13, 2018, 10:34pm

It’s an uppercase ‘i’. I believe it’s for id. I used it for consistency with the new option to pbs_sched. I don’t really care what option we choose, but I think it should be consistent between binaries.

If we keep the options consistent, then -N is already taken. It leaves the scheduler running in the foreground (it’s actually an option to all daemons).

Bhroam

suresht · April 24, 2018, 5:02am

Hi All,

Couple of minor changes are added to the following EDD. Can you please have a look and give your comments if any.
https://pbspro.atlassian.net/wiki/spaces/PD/pages/50947131/PP-337+Multiple+schedulers+servicing+the+PBS+cluster

Thanks,
Suresh

agrawalravi90 · April 24, 2018, 9:20pm

Looks fine Suresh, thanks for updating the page with accurate information.

arungrover · October 24, 2019, 9:52pm

Hi,

I am going to begin work to support submitting reservations that can be serviced by non-default scheduler in a multi-sched setup.
Now, there is already an interface that talks about this user case (Interface 8). But, I want to suggest a change to this interface.
The interface currently states that a new “-p” option for pbs_rsub command will accept partition names which will let server know which scheduler does it have to relay the request to. While this is a fine solution I feel it is probably better that we make it in a way that “scheduler name” is specified to pbs_rsub instead of specifying partition.
This approach will have two potential benefits -

Since partition is not a first class citizen there is no way to list them out. Schedulers, on the other hand can be listed using qmgr.
Users don’t have to assume that the reservation will be confirmed on a specific partition only. Since one scheduler can service multiple partitions, it is possible that the partition user specifies in pbs_rsub command isn’t the partition where the reservation is confirmed.

I also think that server should not let a pbs_rsub command without partition (or scheduler name) option go to the default scheduler. Instead, server should relay such request to all schedulers sequentially (or in parallel) unless it gets a confirmation from one of the schedulers. By doing this, in future, schedulers can potentially borrow nodes from any of the other schedulers (without knowing who has free nodes) to run jobs by submitting a reservation. This will make multi-sched more dynamic in nature and grow and shrink their partitions when needed.

I’d like to know what does community think about this interface.

agrawalravi90 · October 29, 2019, 8:59pm

I like your proposal Arun, my only thought is that do we also want to add some limit attributes on the sched object for per-sched-reservations? This might help admins control which reservations can go to which scheds since reservations can’t be submitted to queues and are cannot be controlled like jobs can, but some sites might want to dedicate some scheds to special purposes and might not like us sending resv requests to them. Just a thought.

billnitzberg · October 29, 2019, 10:01pm

I wonder if it’s possible to do this “automatically” without having the submitter specify a target scheduler at all?

Some of the original thinking around the multi-sched feature was that PBS would (eventually) do everything automatically, e.g., notice that there was a natural partition in the system and, when scale or potential speedup warranted, automatically start a new scheduler to handle it (and then automatically shut it down when it was no longer needed). I feel this is still a great goal. Another goal was to support multiple (different) scheduling policies, and for that, one would need to at least specify a partitioning of the jobs or the nodes (or both). In my opinion, the current implementation requires too much (over) specification by the admin (partitioning both the jobs and the nodes and manually creating schedulers). Explicitly designating which reservation should be serviced by which scheduler feels like an over-specification that would take the overall design in the wrong direction.

Is there a way to get (most of) the use cases without specifying a target scheduler nor a target pool of nodes? I bet in many cases, there may only be a single partition that would fit the reservation, so, for those cases, we could calculate which scheduler to use. How many real-world cases lack this property? What if we just tried the schedulers in some order (or a random order which would be better for supporting parallel schedulers without adding additional constraints) and the first one to confirm wins?

arungrover · October 30, 2019, 5:39pm

Thanks for your reply @agrawalravi90 and @billnitzberg!

@agrawalravi90 I like your idea of schedulers having limits, This will make it similar to how routing queues work. Server will give reservation to a scheduler that is ready to accept it. There are a few things that come to my mind though -

Currently limits are applied to only jobs, this will be a different kind of limit which will not be applied to the job. So I’m not sure how confusing it will be for admins (Maybe not that confusing if we name the attribute appropriately)
From the sound of it, it feels like this kind of limit should be set by scheduler itself. Some sort of publishing mechanism that this scheduling complex has these many resources available. If we make admins set it then, it will again be static and potentially a reservation could end up with a scheduler which might schedule it far out in future when other schedulers were capable of running it now.

I guess it is more of a question for PMs (@scc?) if they see it useful.

@billnitzberg I also feel that server should decide this automatically without making users specify the scheduler where the reservation should go. This is why in my proposal I said that server should relay reservation request to all schedulers sequentially (or in parallel) unless it gets a confirmation from one scheduler. We could also tweak it in a way that server selects the solution of the scheduler which can run the reservation sooner.
Maybe this is the only thing we should do for now, and not implement interface 8 of the document. If we really get a request to implement interface 8 from the field then we can always add that later.
What do you think?

Topic		Replies	Views
Making 'job_sort_formula' a sched attribute Developers	11	1853	November 20, 2019
PP-1018: Design document review of Placement set sorting feature Developers	2	1065	November 6, 2017
Job_sort_formula examples Users/Site Administrators	2	527	June 23, 2023
Replace starving with eligible_time Developers	6	833	January 18, 2021
Job_sort_formula: wrong terms Users/Site Administrators	11	37	January 31, 2025

PP-337: Multiple schedulers servicing the PBS cluster

Related topics