PP-337: Multiple schedulers servicing the PBS cluster

PP-685 targets an architecture for managing configuration and other data throughout all of PBS, whereas PP-748 is restricted to the scheduler. One of the subtasks of PP-685 is PP-687, relating specifically to the scheduler. I believe PP-748 and PP-687 are somewhat at odds with each other. The former describes extending existing functionality while the latter aims to provide a consistent framework for not only the scheduler, but all PBS Pro components.

I have updated the design doc to reflect the changes of separating the policy object into a separate EDD. Please review and provide comments

We should consider the use case of large jobs spanning multiple partitions. It is my understanding sites will use scripts to resize partitions to accommodate large jobs. Might we also consider assigning multiple partitions to a single scheduler and disassociating or disabling (SCHEDULING=false) one or more schedulers from the partition group? It might be easier than redefining the partitions.

We might want to consider using threads as opposed to multiple scheduler processes if we eventually need tighter synchronization between multiple schedulers. I believe the use of separate processes will suffice so long as there is no overlap between the nodes in each partition.

We may need to force a scheduling cycle for one or more of the following events:

  • Partion added/removed from scheduler
  • Node assigned to partition
  • Node added to queue
  • Queue assigned to partition

We might want to consider running certain services as jobs themselves. The scheduler and comm services would be potential candidates. This is really a separate topic with the goal of improving horizontal scalability, but it is something we might want to keep in mind.

I believe we should add a comment attribute to the sched object. This would allow us to communicate to the admin from a periodic server hook if there was an issue.

I also feel that we should set the partition at the queue level but it becomes a job attribute. This would then allow us flexibility to do job sets like Bill mentioned above down the road.

I like the idea of the scheduler comment. I think of it as more of a node comment that the admin can set why they marked a node offline. This way they can set it why a scheduler has been marked to idle. In either case, I like it.

Bhroam

Hi,

This looks pretty good!

PFB some comments on the EDD:

  • I’m confused about exactly what a ‘partition’ is. I didn’t really understand how one would create a partition, does it happen when you set the ‘partitions’ attribute on sched object? What value does having multiple partitions attached to a single sched object add? From what I understand, it seems like partitions just provide the means to assign a queue/node to a scheduler, is that their only role?
  • Interface 1: “$PBS_HOME/multi_sched_1_log (default)” and “PBS_HOME/multi_sched_1_priv”:
    I feel like this might make $PBS_HOME cluttered. Also, if somebody chooses to name their scheduler “server” (just for kicks), we’ll try to create the default scheduler priv directory as “server_priv” and things will error out because PBS couldn’t create a directory by the same name as an existing directory which has nothing to do with scheduling. Also, if somebody names their scheduler “happy_feet”, it wouldn’t be obvious that “$PBS_HOME/happy_feet_log” contains scheduler logs when they send us their logs for debugging. So, maybe create a parent directory “$PBS_HOME/schedulers” which will then have sub-directories for the individual schedulers?

  • Interface 1: “scheduler can not access it’s log or priv directory”: typo: it should be “its” and not “it’s”

  • Interface 1: When the user creates a sched object, what will its default ‘state’ be?

  • Interface 2: “By default, All new queues created will be attached to the default scheduler, until they have been assigned to a specific partition.” and Interface 1: “If no partition are specified with a given scheduler object then that scheduler will not schedule any jobs.”:
    I’m not sure that I completely understand the concept of a partition yet, but architecturally it seems like it would be better to have a “default partition” which gets looked at by the default scheduler and all the queues which don’t explicitly get assigned to a partition. Right now, it seems like all the schedulers need a partition to work except the default scheduler, thereby making the default scheduler special. I think a “default partition” will help make all schedulers be equal, including the default scheduler. What do you think?

  • Interface 6: “Queue will have a new “partition” attribute which can be used to associate a node to a particular partition.”:
    seems like there’s a typo there, it should be “associate a queue to a particular partition” and not the node right ?

  • Interface 6: “Moving queues with running jobs from one partition to another is not recommended and may result into unexpected behavior.”:
    Why not just disallow this and error out?

  • Interface 7: "If server is unable to connect to these schedulers it will check to see if the scheduler is running, try to connect 5 times, and finally restart the scheduler."
    What if the scheduler is not responsive even after the restart or the scheduler’s host is down (considering the future where multiple schedulers can be run on different hosts)?

  • Interface 7: "If a scheduler is already running a scheduling cycle while server will just wait for the previous cycle to finish before trying to start another one."
    Considering that now there are multiple schedulers running and load balancing/performance is one of the goals for this feature, you might consider making things multi-threaded on either the server side, or the scheduler side such that either
    the server creates a thread for each scheduler with a parent thread which looks at the job’s queue and partition to decide which child thread should handle this request, then the child threads can wait on their respective schedulers before instructing them to kick a sched cycle.
    or
    make the schedulers multi-threaded such that there’s one thread which accepts “run a sched cycle” instruction from the server so that the server can return immediately after asking the scheduler to kick a sched cycle.
    I know that there might be complications with either of these approaches, but it might be worth it. The first approach might even help make things like “query server for my universe” and other communications to the server faster as multiple threads will be able to interact with their own schedulers in parallel. What do you think?

  • What will happen if somebody does a “sudo kill -9” on one of the scheduler processes?

  • As far as I understand, right now if there’s a job that requests more resources than are available on any individual scheduler’s purview, it will be rejected, even if the resources might be available on the cluster as a whole. Are we thinking of enhancing this feature to handle such jobs in the future?

As per current design, to create a partition one needs to group their workload and resources. In order to do so, admin will assign nodes and queue to a partition (by giving them the same partition name) and then assign a scheduler to service this partition by giving the same partition name to scheduler’s “partition” attribute.
This is going to change, @suresht is working on a change to make a partition a job attribute rather than a queue attribute. Doing this will allow admins to move queue around without worrying about draining all the running jobs.

:slight_smile: I hope that “somebody” isn’t the admin who is setting up the system. For all other users, they shouldn’t be able to create directories and configure scheduler object. PBS isn’t really written keeping in mind all sorts of goofy things users can do in an HPC complex. We expect the complex to be a protective/trusted environment. For that matter, we don’t encrypt our messages while communicating over network, one can easily sniff and read what is being sent, even today some user can run a qstat command in a loop and just hog the server.

Thanks for pointing this out.

If the scheduler object is created and it is not scheduling, it’s state will be IDLE and if it is scheduling it will be in “SCHEDULING” state.

A configuration with multiple schedulers isn’t the one which every customer would want. Our purpose it to have a minimal to no impact on how pbs currently operates. If we add a default partition then every job will need to have a partition associated to it (even if there is not more than 1 scheduler in whole complex). This will also affect upgrades from previous version to new version.

Thanks for pointing this out. This is going to change. Suresh will make a change and make partition a job attribute.

[quote=“agrawalravi90, post:46, topic:470”]
Interface 6: “Moving queues with running jobs from one partition to another is not recommended and may result into unexpected behavior.”:
Why not just disallow this and error out?
[/quote] When partition becomes a job attribute, queues will not be part of any partition and it will be okay.

You are right, Jon and Alexis raised a similar point but in a different form. They said what if scheduler dumps every now and then, how often are we going restart scheduler. We should have a max restart check in server for each scheduler and then do not try it again. Now the problem will come on how to let admin know that there is a problem. This will get resolved if we have a scheduler “comment” as mentioned by Bhroam and Jon in the previous replies.

These are fine implementation suggestions. Thanks for providing them. We should consider all options to make things faster than they currently are.
PBS Server isn’t really written with threading in mind. There is a plethora of globals that are used everywhere in server. We can think of using synchronization techniques but that will either result into maintenance havoc or poor concurrency.

I guess the same thing that happens today :), scheduler will go down. Server will see scheduler as down and will restart it to run a cycle.

I do not understand what you mean be “it will be rejected”. Job will be queued with a comment stating “can never run”. Looking at this comment someone monitoring the system (admin or a server periodic hook) can then decide on moving job/resources around and make the job run.

Following changes are made to EDD as far as partition attribute is concerned.

  • Interface 3: Removed because partition is no longer a queue attribute
  • Interface 9: New interface for Job level attribute for partition

The above changes are yet to be finalized and reviewed.

True, but what about the clutter? If an admin starts up a 100 schedulers, there will be 100+ listings inside PBS_HOME. Is there a reason why putting them all under a ‘schedulers’ directory is a bad idea? It will prevent such a clutter, and have the added benefit that a directory called “gopher_1_priv” under PBS_HOME/schedulers makes it clear that it’s a scheduler directory.

So I’m guessing that it will be IDLE if one just creates a sched object without any arguments except the name. I think we should mention this in the list of default attributes under Interface 1.

That’s what I meant, right now this is a manual process, I was wondering if we have considered making this automated such that if somebody submits a job that needs more resources than are available with any of the schedulers, it would still get scheduled without the admin having to manually move partitions around for a single job.

The end goal is to automate this. But this is beyond the scope. Hopefully how the design is laid out well enough that we will be able to do just that in an automated fashion in a follow on project

1 Like

Yes there will be clutter and having all the scheduler directories under one “scheduler” directory isn’t a bad idea at all. There are few more things to look at here.
We do not enforce admins to create priv and home directory of each scheduler under “$PBS_HOME”. They could potentially create it anywhere. If we choose to do what you mentioned then we will have them restricted to $PBS_HOME/scheduler/ directory. If we choose not to go with creating “scheduler” directory then we should have tools like pbs_diag smart enough to be able to read any non-standard directory (based on qmgr -c “p sched @active”) and collect the data.
I’d let @suresht to have your suggestion considered and make changes if necessary.

hmm… Now I get your point. For a half baked scheduler config I think we should represent the state as something else. How about naming it as “NEEDS CONFIGURATION” or something else? What do you think?

Admins can already configure where to store PBS_HOME, is it really a requirement that they also have the flexibility to put each scheduler’s priv/logs at an arbitrary location, possibly outside of PBS_HOME?

I like “NEEDS CONFIGURATION” but it’s a bit lengthy … how about “UNINITIALIZED” ? If that’s not very informative, then I’m fine with NEEDS CONFIGURATION as well.

There isn’t any requirement to do this. Admins can configure different PBS_HOME for server and a different HOME for scheduler and then for mom. So, there is already a way that they can set up priv and log directories in arbitrary PBS_HOME for each of the daemon. But, I get your point, there is no compelling reason for supporting such freedom.

I like “UNINITIALIZED” state.

Hey,
A few more comments:

  • What happens if two schedulers set the same priv or log directory? I can kind of see having the same priv directory, but having the same log directory sounds bad. Two schedulers would try to overwrite each other’s logs. Should it complain?
  • I’d rather not see us own the scheduler comment. I’d rather see it as a node’s comment where the admin can set it. If a functioning scheduler always has the comment READY_TO_USE, then the admin can’t use it. We’d always be stomping on it if something bad happened. I believe the node comment is used by us and the admin. If the admin has set it, we won’t overwrite it. If they haven’t set it, we can use it ourselves.
  • I thought job_accumulation_time was being moved off to another RFE. I didn’t think it was going to be implemented by multi-sched.
  • In interface 2 you say that if a scheduler has no partition set, it won’t schedule any jobs. You also say that all queues that are not part of any partition are part of the default scheduler. Does this mean that the default scheduler doesn’t get a partition set? That would mean that any queue that isn’t part of any other partition is owned by it? Are we trying to make the default scheduler just another scheduler? If so, this breaks that. Maybe all queues should start out being part of the ‘default’ partition and the ‘default’ partition be assigned to the default scheduler. This way there isn’t any behind the scenes defaults going on. It does mean that we’ll have to reset a queue to the default partition when its partition is unset.
  • Interface 4 you say that you won’t allow scheduling or scheduler_iteration to be set on the server object. These are stable interfaces that haven’t been deprecated yet. You can deprecate them and print a message, but you need to make sure they still work. I’d suggest if someone sets them on the server object, set them on the default scheduler.
  • Interface 7: You talk about about the pbs_server stopped with the -s option. That’s an option to qterm. You should probably say ‘stopped with qterm -s’
  • Interface 9: Currently attributes can’t be defaulted at the queue level. Only resources can be defaulted. Will you add the ability to default attributes?
  • Also interface 9: What happens if someone sets the partition attribute on a job and the job is qmoved? If it is defaulted, it should be fine. It’ll pick up the new default. If it is set manually, it won’t pick up a default. It will remain set. What happens if it is qmoved to a different server that doesn’t have that partition?
  • How does peer scheduling work with multiple schedulers? On the surface it sounds like it’d work fine. You’d map a remote queue to a local queue in the scheduler’s sched_config. What happens if that remote job has a different partition set? Does that matter if you’re doing peer scheduling?

Bhroam

Yes it should complain

The comment field is for the site to set and not the scheduler. I have made the necessary change

The default scheduler should be the main scheduler. If the partition is unset on a queue or node, the default scheduler will manage these resources

Good catch

Added the appropriate change

No. The intent is to set this up in such a way that down the road we can have two scheduler scheduling jobs from the same queue to different node partitions. For example if you created a scheduler to handle all jobs in the workq 1 node or less on a subset of nodes while another scheduler manages all of the multi-node jobs.

I would expect peer scheduling to work fine if we use a job attribute and not a resource. Am I correct in my understanding?

Your new message: “scheduler can not set its log dir to the same dir as or priv directory” reads a little funny. I’d say “scheduler can not set its log or priv directory to the same dir as ”

Interface 6: I’m still not sure I 100% understand how the partition on the queue and job work. Are you saying that the partition attribute on the queue is a special attribute that sets the partition attribute on the job? If that is true, I think we’ve gone from paving the way to the future to making the current implementation more complex than it needs to be with unprecedented caveats. You say that we recommend setting the attribute as a job-wide attribute at the queue level. You can’t set attribute defaults at the queue level. Unless you want to creep the scope of this project to implement that, I think we should rethink the job attribute.

Defaults don’t get applied until the job is moved. This is right before the job is run. The partition value will be queried as it is set on the remote server. If there is no scheduler servicing the job’s remote partition, the job will be ignored.

After a conversation with Bill, a concern came up. Right now there is one usage file, so pbsfs knows where to look without being told. When we move to N usage files, how will pbsfs know where to look? Will we need to provide some way to provide a path? Provide a way to query a named scheduler? I like that one less because right now PBS can be down and you can still run pbsfs. If we need to make pbsfs query the server to get a path to sched_priv, the server has to be up.

In any case, this will likely be an interface change. It’ll need to be spelled out in the document.

Bhroam

Bringing back some comments from my Apr 26th note:

I suggest keeping the old way of starting scheduler daemons (via OS startup scripts and/or cron), and not introducing a new PBS Pro interface (for managing scheduler daemons).

If such a facility is desired, I suggest it be tackled in a future RFE for a few reasons: (1) it reduces the effort to get this RFE done (and the main goal is merging in existing commercial-only code into OSS), (2) it reduces the effort to get this RFE correct (and testing this feature will be hard, especially with fail-over), (3) it is better to design the “scheduler config in qmgr” all at once (versus part now to start schedulers and part later to add config), and (4) we expect very few sites to use this version of the MultiSched technology, and so will have ample time to improve it in future releases as we get more feedback.

This change to the design would affect interfaces 1, 2, 3, and 7 in the current version v.24.

The text now under “Notes” at the end includes “When there are multiple scheduler objects configures following things might be broken.”

The design needs to explicitly state what is supported and what is not supported, and for supported features, exactly what they do. I suggest explicitly stating anything that “may be broken” as “not supported” (and/or erroneous). One can always add support for these features in the future if they are desired.

Thx!

I completely agree. We should not be introducing new interfaces to start/stop daemons. These are best controlled by OS/manual(admin)/Cluster manager interfaces. None of the PBS daemons are controlled by qmgr, and I do not see any real use case in going that direction.

I think we should remove the need for storing the hostname in the server/scheduler object information as well. I suggest we go by using the name of the object for the purpose of accepting a registration from a scheduler. The port is useful information that the scheduler would need to know to bind to, but the additional hostname does not help in authentication.

If we change the direction of the connection, ie, instead of the server connecting to each scheduler, allow the scheduler to register to server (and be a permanent connection) it could be nicer.

About Interface 9, do we really need to add the partition attribute to the job as well? The use case of moving a queue between partitions can be satisfied by temporary disabling scheduling, waiting for the associated scheduler to show a status of IDLE and then making the move, no?