PP-337: Multiple schedulers servicing the PBS cluster

@arungrover, Please elaborate PBS Upgrade and sched config in fresh install . Would be great if u document upgrades cases. I see that schd config migration is needed in case of upgrade.

Thanks for reviewing the document @varunsonkar
Please find my comments below-

I’m getting into implementation detail here - If a user tries to qrun a job without using “-H” option then server should check the job belongs to which partition and trigger a scheduling cycle for that job.

I have not thought about it. I’m glad you raised it. There could be two straightforward ways that we can deal with reservations:
1 - Reservation only go to the default scheduler running the default partition…
2 - Reservations can be changed to be submitted to a particular partition and gets confirmed by that particular scheduler.

Well since scheduler is backward compatible with old sched config file so there isn’t any necessity to create the policy object for default scheduler. Only compelling reason I find for a default policy object is that there are bunch of policies that have moved from server to policy object (like backfill_depth, job_sort_formula). I’ll add a section about default policy object.

Well the requirement was to start scheduler as a non-root user. Scheduler although by itself comes with some special privileges that allows it to do some of the things that requires more than a manager privilege. So I don’t think even adding this user as a manager would help.

I’ve tried to keep the existing behavior as untouched. so one can still start or stop scheduling on the default scheduler without giving scheduler name. But now I think I should probably make it necessary to provide a scheduler name because otherwise it will create confusion. I’ll add it to Design proposal.

We can’t really have a formula set on multiple places. So this needs to be handled during upgrades. I’ll probably add a section about upgrades.

Thanks!

Thanks for reviewing @suresht

There isn’t a requirement to run scheduler on a different host (other than where pbs_server is running). I kept it because there isn’t anything that is stopping us from doing that too. All we need to do is a run a command on a remote host to start the binary.
Since there isn’t any use case right now, Interface 1 of the document mentions it as a read only attribute having the same value as PBS server host.

Thanks for your review comments @vinodchitrali

After reading your comment and Varun’s comments, I agree there is a need to specify what needs to done when upgrade happens. I’ll add that to the document.

Hi @arungrover,

Thanks for the replies.
You mentioned two approaches for reservation:

1 - Reservation only go to the default scheduler running the default partition…
2 - Reservations can be changed to be submitted to a particular partition and gets confirmed by that particular scheduler.

In my view, I think approach 2 makes more sense now.

I like the option 2 approach as well

I like option 2 as well, but we should understand the ramifications of it. Currently jobs are submitted to a queue that is associated to a scheduler. Reservations are not submitted to queues. While we could come up with a way to submit a reservation to a scheduler, that would lead to two different ways to submit work to schedulers. We should only have one way.

The only way I can see to do this is to move the scheduler association to the request. We somehow submit a job/resv directly to a scheduler instead of having it submitted to an intermediate container that creates the association.

I’m not talking about moving a node away from a partition. You should be able to change the partition attribute (or unset it) without any issue. I’m talking about deleting a node that is associated to a partition. Currently if a node is associated to a queue (via the queue attribute) you get a node busy error. Do you want the same node busy error to happen if a node is associated to a partition?

You have an error being thrown if the priv or log directory is not set properly (15211 error). I assumed that this error meant that the operation didn’t complete. Is this just an informational message? If so you should make the document more clear. If this is an error then it sounds like the directories need to exist before the scheduler is created.

I guess I am confused on what is the difference between the default scheduler’s partition (unset?) and what none means.

I suspect you’ll just have to deal with the situation where a job is in a queue for one scheduler with nodes of another. It’s either this you force the admin to stop and drain the queue before moving it from one scheduler to another. I don’t really like saying it might have unexpected behavior without providing a method of fixing it.

If a scheduler is deleted the nodes/queues associated to that partition will have their partition attribute unset. This means you won’t have nodes/queues associated to a partition that is not associated to any scheduler. From what you just said, you can create this situation by setting node/queue to a partition that isn’t set on any scheduler. I’d either force a partition to be set on a scheduler first or not unset them when a scheduler is deleted.

One last thing, you should probably mention what happens on failover. I suspect the same thing happens as today. Today when the secondary takes over, it will try and contact the scheduler on the primary. If it can’t, it will start a scheduler locally to the secondary. I suspect the same thing will happen just N times.

This means the host attribute for the schedulers might differ. If some schedulers are still up but others are not, the schedulers might be split between the primary and the secondary.

Speaking of the host attribute, if at some point in the future you want to support schedulers running on alternate hosts, it will require the server to start a process on a host that isn’t local. This isn’t necessary easy (unless we have a mom running there). You might consider just removing the host attribute. It’s read only now and you can always add it in the future if needed. This is unless you need it for the primary/secondary case.

Bhroam

I see that, launching all schedulers on same host is a problem. IMO, Keeping single point of failure and horizontal scalability in mind, we should start supporting schedulers on different hosts using agent based modeling.

policy is created through qmgr. Its a good idea. But i see one thing is missing here

I don’t see IFL api for policy object. There are few other objects like Server, Queues falls under same category. They are created/modified using qmgr. These objects have IFL to stat. Considering unified command structure, Do you want to address this issue?

I’m confused here. Reservations are submitted as a job but they eventually act like a queue. With multiple schedulers, queue will already have a way to get itself associated with a scheduler, in a similar fashion we should be able to associate a reservation to a scheduler too because they act like queues too.
This new way of submitting reservation will not only let the partition scheduler confirm the reservation but also make that scheduler jobs from this reservation.

Well in that case, do we know why a node is not allowed to be deleted if it is linked to a queue? I can see that if node has running jobs on it then we shouldn’t allow it to be deleted but I can not picture a scenario where deleting an unused node.

I’m confused here, your original comment was to allow creating scheduler but do not allow “scheduling=True” unless the directories are created. The document mentions that when scheduling attribute is set to true then server throws this error provided directories are not accessible. Now that “scheduling” attribute is not part of scheduler object, I assumed that scheduler object is already present when admin tries to set “scheduling” attribute as true.

Sorry that it is confusing, If server is going to create a default scheduler which handles all the queues/jobs/nodes (unless partitioned) then this scheduler will have it’s partition attribute as unset. If we are trying to create a scheduler object using “qmgr” then I wanted it’s partition attribute to be set to something like “None” which means it wouldn’t mangle with default scheduler’s operations.

Yes, I’ll add draining the jobs to the document when I add this part.

I think it is probably best not to unset the partition from queue/nodes because admin might create another scheduler in future for the same partition. Moreover, unsetting partition on queue/nodes when scheduler is deleted will look like it is closely associated with scheduler object.

I’ll mention it in the document.

I’ve updated the document according to the review comments received.
Please review again.

I don’t see the schedulers being on the server host being the single point of failure. The server has always been the single point of failure. At any point in the future that we have multiple servers on multiple hosts, they can start up schedulers. I don’t see it being limiting that we require a server being on the same host as the scheduler.

I can’t either. I just like consistency. I think we should either follow suit and error with node busy or change it so you can delete a node w/o removing its queue association first. The latter might be more work because you’d have to make sure the queue’s hasnodes attribute is correctly unset when the last node is removed.

I misread the document. I thought the 15211 message was printed when the scheduler was created, not when scheduling was set to true. I like that behavior.
Quick question: You said the scheduling attribute is not part of the scheduler object? Is that a typo? If not, where is it?[quote=“arungrover, post:32, topic:470”]
Sorry that it is confusing, If server is going to create a default scheduler which handles all the queues/jobs/nodes (unless partitioned) then this scheduler will have it’s partition attribute as unset. If we are trying to create a scheduler object using “qmgr” then I wanted it’s partition attribute to be set to something like “None” which means it wouldn’t mangle with default scheduler’s operations.
[/quote]

If “None” means no partition, should you point this out more explicitly? That there is a keyword that can’t be associated to any queue/node and means no partition? I do agree that it should be a special keyword and the admin should not be able to associate nodes to the “None” partition

I just read in your document that you plan to not allow changes to the server’s scheduling/scheduler_iteration/etc attributes. These are stable interfaces and need to be deprecated first. If an admin sets one of them, I’d print out a message and set it on the default scheduler’s policy.

In the section on failover you say the secondary will create the scheduler locally. There is no reason for it to do that. The secondary reads the database and will have the schedulers. It just needs to start them (and update the host attribute).

In the nodes section you say if a node’s partition is unset, then the node is not part of any partition and it’ll be scheduled by the default scheduler. I’d rather you rephrase that as just if it is unset it is part of the default scheduler’s partition.

In the bullet that lists the scheduling events that cause a scheduling cycle, you missed qrun.

I don’t think you should wait job_accumulation_time before starting a qrun cycle. You’re not waiting for more jobs to be accumulated. You’re only running one job.

There are several options on how to implement job_accumulation_time. One is how you described. Once you get one event, you wait that time and then start a cycle. Another is that the time is the minimum amount of time between cycles. This means if you get two events close together, you’ll wait. If you get two events far apart, you’ll immediately start a cycle. I like the second method better. Also keep in mind automated tests. This attribute can not be turned on by default, or all automated tests will slow down. You should actually say if this attribute is set or not by default.

As a note, the init script will report status of the mom as well as the server. You said it’ll only report status of the server now. I don’t think that is your intent.

Bhroam

@bhroam, I agree with u. If server and schedulers will run on same host then I would suggest to restrict the number of schedulers. Benchmark the max number of schedulers that can be created.

Also, If scheduler and server are tightly coupled on single host, then previous design of running scheduler on different host will break. And makes a case to club both server and scheduler as one with multi-threading.

Really nice work – I love the MultiSched feature. A few comments regarding v.11 of the design:

  1. Consolidating scheduling configuration into qmgr is really great. However, it is very unsatisfying to move most (but not all) of the scheduling configuration into qmgr. In v.11, the design leaves the holidays, dedicated time, and fair share group configuration in files, requiring multiple sched_priv directories). I would strongly advocate for either moving all configuration at once (or none at all), see next item.

  2. What about staging the two major parts of this feature into separate deliverables? Both (a) supporting multiple schedulers acting on different partitions in parallel and (b) moving all scheduling configuration from files to qmgr are very large changes. Completing one, then embarking on the other would greatly reduce the risks in terms of timeliness, quality, and especially, correctness.

  3. In the middle of Interface 2, a new setting “job_accumulation_time” is proposed. I see how this would be useful, but it seems completely independent of this RFE (as it would be useful with a single scheduler) – what about breaking it out as a separate RFE?

  4. Great job on interface 3 with the policy object (with the one objection that it should cover all scheduling configuration, not just some of it). One suggestion would be to move the prime and non-prime (and even dedicated time, and holidays) designations inside the policy object as an iCalendar recurrence. So, instead of using “p:XYZ” to say that XYZ applies during prime time, add a new “time_window” object (might need a better name), where one would “create new time_window p = ‘RRULE:FREQ=WEEKLY;WKST=SU;BYDAY=MO,TU,WE,TH,FR’” and then use it inside the policy object as “set policy XYZ time_window = p” or on a queue as “set queue workq time_window = p”.
    (In summary, PBS Pro already uses iCalendar recurrences, and they are full-featured enough to handle what’s needed here, we should leverage them!)

  5. In interface 6, what about making “partition” a job parameter? Generally, there has been a desire to move away from queue-focused settings to provide more flexibility in PBS Pro. By making this setting a per-job setting, one could easily support the per-queue use case (by having a queue default that flows to jobs). In fact, one could have a server default that flows to queues to jobs, and the server default could be the default scheduler. For example, per-job partitioning would allow one have high, medium, and low priority queues and still use MultiSched to put 1-core jobs on a 1-core partition of nodes).

  6. Interface 7: suggest thinking more about this. There is great value in leveraging the operating system startup facilities to bootstrap multiple daemons and ensure they are running. There must be a reasonable way for a scheduler to authenticate, and a reasonable way to handle notifications. Forcing the server to start the schedulers will make it very hard to support broad horizontal parallelism (across many nodes)… which is a long-term goal.

  7. Interface 8 – Generally, PBS Pro should shield end-users from as much “IT complexity” as possible. This interface forces end-users to add an understanding of “site partitions” to their knowledge base. It would be much better if PBS Pro could do some automatic translation to avoid pushing this additional complexity onto end users. Are there cases where we would not be able to figure out which partition to use? Perhaps requiring a queue name instead of a partition name (as a reservation submission is akin to a resource request, and all other resource requests go to queues)?

  8. Interface 9 – should eventually be expanded to explicitly list what is supported, and what is not supported, and what will cause errors to be returned.

Thanks!

Good point. We are investigating how to be able to move the other configuration files into qmgr as well. The EDD is a blue print of what we want to accomplish as part of the Multi-sched work. However, all of it may not make it in to 17.1

I think it makes sense to break this into a separate epic and start a separate design doc.

OK

After talking with Arun, I think it makes sense to move away from the terms prime/non-prime and move to time_windows. We will discuss this more as part of the new epic to move scheduler configuration files to qmgr

Interesting thought. I am sure the engineering team can implement it as a job attribute but I hesitate to do more than that for the initial implementation. Having partitions inside of queues introduces many challenges that are outside the scope of work for this project.

You are correct that a partition name would not make sense to users initially. But something that the admin could explain to the power users who request reservations. Or maybe it would be better to have them us a queue name since that is exposed to the users and then we map it to the correct partition. Not sure what to do if the queue has nodes assigned to it. Do we allow the scheduler to use any nodes in the partition or only the nodes assigned to the queue?

Agreed.

I have created ticket https://pbspro.atlassian.net/browse/PP-748 as the epic for the move scheduler configuration files to qmgr. Lets move the associated items from the EDD to a new EDD for PP-748

PP-685 has already been created for this. I believe PP-748 is a duplicate.

@mkaro I am not sure I follow. PP-684 is an epic to create a configuration management interface for all configuration (server, comm, mom, and sched). This will require many things that are independent of the work in PP-748. PP-748 is to move the scheduler configuration files into qmgr. I believe that this will help reduce the work of PP-684 once completed