PP-337: Multiple schedulers servicing the PBS cluster


I’ve posted a design proposal for PP-337 to add support for creating/running and managing multiple schedulers that can service in a single PBS complex.

Please have a look at the design proposal and provide your feedback.


1 Like

Why are you removing job_sort_keys? It is often very difficult to emulate the result of key-ordered sorting using a formula. Instead, why not add job_formula as an acceptable sort key and always use key-ordered sorts?

@dtalcott I understand your concern about removing job_sort_keys. Can you please put a little more detail about why it is difficult to emulate the result of key-ordered sorting using formula?

Simplifying somewhat, a job sort formula is a sum of weighted values for job attributes. Sometimes, the actual job attributes pass through a function first (e.g, log(), sqrt()), but that doesn’t matter here. Without loss of generality, we can normalize the attribute values to range from 0.0 to 1.0. That is, assume the maximum possible value for the first attribute (value1) is max1 and its minimum possible value is min1. Then, we set k1 = (value1 - min1) / (max1 - min1). Do similarly for value2, value3, … . Then the job sort formula looks like

k1 * w1 + k2 * w2 + k3 * w3

where the w’s are the weights for each normalized attribute.

Now, for each attribute, denote the smallest difference in attribute value that matters for sorting as delta1, delta2, … E.g., for integer priorities, any difference is probably significant. So its delta would be 1. Normalize these deltas the same as you did for values, e1 = delta1 / (max1 - min1)

With this background, we are now ready to try to emulate sort keys with a formula. Starting with the trivial case of one sort key,
the obvious formula is

k1 * 1, or just k1

That is, the weight is 1.

Okay, now let’s add the second key. The defining characteristic of keyed sorts is that values for later keys cannot change the order determined by earlier keys. That is, the first key sorts all the items into an ordered list of bins of items that have the same value for the first key. The second key then sorts within those bins. No possible value for the second key can cause an item to move to a different first key bin.

Applying this rule to the case of two keys, the formula becomes:

k1 + k2 * e1

That is, we have to keep the contribution of k2 in the total value to less than the smallest significant change to k1, which is e1.

Continuing with a third key gives

k1 + k2 * e1 + k3 * e1 * e2

So, the weight of each normalized key is the product of the normalized deltas for all previous keys.

This works, provided we can predict a min, max, and delta for each key. For some possible keys, this is easy (e.g., priority). For others, it’s a little harder (e.g., job age ranges from 0 to PBS’s infinite age (5 years?) and has a delta of one second). But, suppose some site wanted to sort jobs based on the job name. Names can be up to 236 characters long, so our formula has to maintain at least ~1800 bits of precision to distinguish among job names and the weight would require ~500 digits to specify. Floating point keys also require care.

All of this complication goes away with keyed sorts.

@dtalcott Thanks for a detailed example of how deriving a single job_sort_formula could be tricky when dealing with a hierarchy of sorts.
Would it help to expose a way to have admins supply a list of of formulas in an ordered way? it is sort of what job_sort_key would do, but it is now represented as a formula.

@arungrover Thanks for putting this out. In reviewing it I believe it should not remove prime/non-prime options. If you were to place them on the queues vs removing them then it would simplify the policy object and allow for some possible improvements. For example, it you set attributes on the queue like primetime_enabled=T nonprime_time=T as the defaults then each queue would be an anytime queue. If the user set one to false then it would either be a prime or nonprime queue. This would allow us to get rid of the prime_prefix and the nonprime_prefix. Next, we could also allow them to set the weekdays, Saturday, and Sunday times at the queue level. Or maybe we extend it to all days of the week and also add the weekends option. This way if a site works 4 10 hour shifts they could also set Friday as a non-prime day. Another thing that would be useful is to allow the site to set different policy objects at the queue level for when a job is in prime time vs non-prime time.

I think we should also consider adding a dedicated queue attribute for the queue so we can also avoid the dedicated time prefix by adding a attribute or something along those lines to indicate that a queue is a dedicated queue. I think we should also plan on how to remove the holidays file by adding by a time object where the admin could define prime times. It would also be nice if they could specify multiple prime windows in a day. As for defining the non working days I think we should define that at the server level and allow users to add multiple years of information. As they days pass by we can remove them from the server and as new days are added we could sort them so the one closest to today shows up.

That would handle many use cases, but not all. Look at the various comparison routines in sort.c. Most of those could be converted to formulas, but a significant number of comparisons can order items without assigning them constant numeric values. For example, look at cmp_aoe(). It is used when sorting nodes, but the result of the sort depends on the job under consideration.

The keyed sort code already exists: why not let it be used?

@jon and @dtalcott I’ve made changes according to the review comments you have given.

Looks good. How do you intend to set the prime/non-prime time windows along with the holidays?

Issue with holidays file is a tricky one. We sure can have a time object that has the same way of defining prime/non-prime time and holidays as we do today and have that assigned to scheduler.
Things get murky if we have a requirement to define more than one prime-time and have each of these prime/prime-times have a different policy object assigned to them.

I still haven’t figure out a simple way of stating all these with the scheduler object.


Thanks for the additional information. The design as is looks good for now. I have no further comments.

Hey Arun,
I have some comments

  1. If all scheduler’s have their default priv and logs directory prefixed by the scheduler name, shouldn’t the default scheduler be named “sched” instead of “pbs_sched”? If it is named pbs_sched, the directories should be called pbs_sched_priv and pbs_sched_logs?.

  2. You say the directories have to be owned by root with certain permissions. What happens if a directory doesn’t exist when you set the attribute? Does it complain or make the directory?

  3. I’m not fond of the name “job_accumulation_time”. It makes me think of the amount of time a job accumulates something not server accumulating jobs. How about wait_before_cycle_starts or something with the word wait in it?
    I still don’t think we need this functionality. The scheduler will handle it on its own. The first couple of cycles will be kicked quickly, but as jobs pile up, they slow down and more jobs get accumulated between cycles. Even if we do implement it, the same effect will happen and jobs will pile up and cycles will slow down. It won’t make much difference in the end if it is there or not.
    I think providing this feature will allow an admin to shoot themselves in the foot. If they set this abnormally high, it is just slowing down how fast a job can run. The whole point of multi sched is making it so jobs are quickly run after they are submitted.
    I guess I need to see data on this before I can be convinced it is truly worthwhile.

  4. What happens if you delete a queue that is associated with a scheduler? Does it disappear from the scheduler’s list or does qmgr get a queue busy message (similar to how you get a node busy message today).

  5. How do queues move between schedulers? Do you have to remove it from one before you add it to another? What happens to the queue if you just remove it and not add it?

  6. 3328 is the current default value for log_filter. It’s filtering out 2048, 1024, and 256. You want to come up with a positive log_event that includes everything except those.

  7. What happens if you have just a prime or non-prime policy, but not both? Is a scheduler left without a policy for half of the day.

  8. In interface 4 you say qmgr -c “s sched policy.backfill_depth=3”. Shouldn’t you be setting that on a policy object?

  9. What happens if a node is associated with a scheduler and a queue? You could have a situation where a node is associated with scheduler A and queue B. If queue B is not also associated with scheduler A, you’ll have a problem. Maybe disallow having both set at the same time?

  10. I know it isn’t listed there, but will a scheduling cycle get started for a prime/non-prime/dedicated queue when prime/non-prime/dedicated time starts? This will fix a long time issue with PBS where a new prime/dedicated status starts but the scheduler can wait up to 10 minutes before a new job starts.

  11. Do we need PBS_START_SCHED any more? Can it be deprecated?

  12. Now that the server is starting the schedulers, what happens if you do an init.d/pbs status? It normally shows the pid of the scheduler.


After reading @bhroam comments I guess I missed point 2. We need to be provide the option to not run the schedulers as root. Our goal should be to make it so that only the processes/functions that need to run as root are run as root (i.e. the launching of the job as a different user). I would suggest that we provide an option that allows the site the option to run the scheduler as a non root user. One option might be to add a server option (set server pbsadmin = root (default)) and we run the various non mom processes as this user. So that down the road we run pbs_comm, pbs_sched, and pbs_server as the pbsadmin as defined in qmgr

Hi @bhroam & @jon,

As per your review comments I’ve made changes to the document. One of the major change is that now scheduler can be configured to run as a specific user and if not specified it will run under the same user privileges as server is running.

I still think that this might be useful. It is would not only help when more jobs are queued but also when jobs are ending. In our current implementation we start a cycle as soon as we hit a job end event. It might happen that the void in resources made by this job would only fit one job but we pay a lot of time penalty to run that one job by querying the whole universe from server.
I’m trying to find some data points to prove my point. I hope I find them :slight_smile:

You bring up an interesting point. How about making an attribute called partition and making sched,node and queue link to each other using this partition. It will help in moving/removing a queue/node in and out of partition and if the partition is not set on any of these objects then they get handled by default scheduler.

How about letting admins set prime/non-prime policies only when they have specified the all time policy. That will remove any ambiguity on which policy to consider.

I’ve added changes to the document to make sure that the queue node is getting associated to is part of the same partition.

well you are right, only problem I see is that to do this server needs to be aware of prime/non-prime time of each scheduler object. I have not thought about it but I guess it makes sense to make this as a cycle trigger event as well.


I’m guessing that now that schedulers are all handled by server may be status of PBS could only show server’s status. And, server start stop will internally close all spawned schedulers too. What do you think?

I’ve made some changes to document, please have a look at it again.

Arun Grover

Hey Arun,
Thanks for making the changes. I have a few more comments.

This data shouldn’t be hard to create. Leave scheduling on, submit N jobs (where N is like 10000 or more). Implement the feature and try again. The feature shouldn’t be too difficult to implement in a PoC method. Basically when an event happens to cause a scheduling cycle, add a task to the timed task list if there isn’t one already there.

I’m not sure this will work. If you want a prime and a non-prime policy, it requires you to have an all time policy. Will the all time policy be used in any way? This brings up a good point. If you have an all time policy and a prime/non-prime policy, do you get the union of the two policies? What happens if one of the attributes is set to different values between the two policies?

I’d rather just make it an either or. You can associate a node to a queue or to a partition, but not both. They both do really the same thing. In one case you are associating a node to one queue. In the other you are associating the node to a set of queues.

Now that I think about this more, I think this is a fine thing to do later. Right now prime, non-prime. and dedicated time are still defined in files in sched_priv. This would require the server to read and understand those files. We should probably wait on this until the server directly knows when prime, non-prime, and dedicated time starts and stops.

The init script also shows the mom’s status. The document says it will only show the server’s status. Another thing you could do is if the server is up, use qmgr to query the status of all the schedulers. It probably isn’t worth it though. You say when you do a qterm, it will take down the schedulers. There is the -s option to qterm that takes down the scheduler (along with the server). We should probably just make -s take down all the schedulers and leave qterm to just take down the server.

Other comments from the document:

  1. With the way you have it now, the admin has to create the scheduler directories before creating the scheduler. It also forces them to create the scheduler with the directory attributes on the same line if they are different from the defaults. This is kind of restrictive. Maybe change this so you can create the scheduler without the directories, but can’t set scheduling=True until they are created? You’re already doing this check.
  2. Please move backfill_depth to the policy object instead of the sched object. It seems out of place on the sched object since it is really policy.
  3. I’m not sure if you care about this, but if you try to delete a node that is associated to a queue, you get a node busy error. Do you want to do something similar if a node is associated with a partition?
  4. I would call PBS_START_SCHED a pbs.conf variable instead of an environment variable. While it’s technically true you can set it as an environment variable, that’s more to override what is in pbs.conf.
  5. The default for a scheduler’s partition variable is none (should it be unset?). Does this mean it has to be set before you set scheduling=True? If so, you should say what the error message that will be printed in that case.
  6. In the scheduler section, you say “a list of partitions”. Could you change that to “a comma separated list of partitions”?
  7. There is a slight race condition when changing a queue from one partition to another. You first have to unset it which will return it to the default scheduler before you set it again. If a scheduling cycle starts during the short time where it is unset, those jobs might start on the default scheduler’s nodes. Maybe say something like, "It is recommended to turn scheduling=false on the default scheduler before changing a queue from one partition to another.
  8. I’m guessing there is an order of operations when setting partitions? You first have to “create” a partition by setting it on a scheduler before setting it on a queue or node? The thing that makes me think this is that if a scheduler is deleted, the partition attribute is unset on nodes and queues. If this is true, you should probably say this explicitly. Also if you can’t set an unknown partition on a queue or node, you should say what the error message is when you try and set it.

Its not quite the same thing. In the current setup of PBS a partition is the whole cluster and by assigning a node to a queue then only jobs in a given queue can run on the set of nodes. This is just allowing sites to shrink the entire cluster to a smaller partition. So in cases where sites have multiple clusters consolidated under a single PBS server they can still have a scheduler managing the various hardware clusters as well as assigning nodes inside each hardware cluster to a particular queue.[quote=“bhroam, post:16, topic:470”]

  1. There is a slight race condition when changing a queue from one partition to another. You first have to unset it which will return it to the default scheduler before you set it again. If a scheduling cycle starts during the short time where it is unset, those jobs might start on the default scheduler’s nodes. Maybe say something like, "It is recommended to turn scheduling=false on the default scheduler before changing a queue from one partition to another.

We could cause this event to restart the scheduling cycle or could we queue the partition change event so that it waits until the scheduling cylce ends before it is made available to the new partition

I left it as is because I thought there could be a use case that a queue within a partition may have a dedicated node to run on. In that scenario setting partition and queue both would help.

I agree, If we have a better way of defining prime/non-prime (multiple of them) with holidays then server can read that data and trigger a sched cycle for each one of the schedulers configured.

I think since now status of each scheduler can be seen via qmgr and that server is the one responsible for starting and shutting down schedulers, it is probably okay not to show sched status from init script. I’ve made changes to use “-s” option and also made that as the default option in init script.

There is already a default priv/log directory that the scheduler gets. So the object can still be created even if the directory is not there.


I think partition is a virtual entity which links nodes,queues and scheduler together. I thought moving a node away from partition shouldn’t cause any error otherwise we wouldn’t be able to move the nodes around.


I thought I’ll set it to something that isn’t a partition and it wouldn’t actually look at any jobs or queues unless it is set properly. do you still think it needs to be unset?


This is a very interesting point!
I don’t know how to deal with it right now but my guess is that we should mark the queue as stopped before the move and then start the queue again once it has been moved. Only problem I see is that if the queue has some running jobs in it then scheduler in second complex will start seeing jobs are are running of nodes which aren’t part of the complex it schedules on. May be we can list that moving queues may result into unexpected behavior if there are running jobs on it.
What do you think?

I’d say the order shouldn’t matter, If partition attribute is set on queues/nodes before scheduler is associated to it then there will not be anyone to schedule the jobs these queues may have. If it will make things simple then we can probably leave the partitions as is on queue/nodes if the scheduler objects gets deleted.

I’ve made the changes you suggested. Please have a look.


Hi @arungrover,
I have couple of questions:

  1. What will be the effect for “qrun” command on multi-scheduler. Is there any limitation here?
  2. What about reservations with respect to multi-scheduler?
  3. For a fresh install with this implementation. There will be a default scheduler “sched” as per the EDD. But will there be any default “policy” object created/associated with respect to “sched”?
  4. Suppose multi_sched is created like : qmgr -c “c sched multi_sched_1 sched_user=‘abc’”. So will user “abc” be added automatically to the managers list or it has to be added exclusively?
  5. In the EDD I see "If no name is specified then PBS server will enable/disable scheduling on default scheduler"
    Will the same apply while associating all other scheduler attributes and policy to default scheduler as well?
  6. Interface 4: "Attribute job_sort_formula has been moved from server to scheduler policy attribute"
    Will there be backward compatibility for default scheduler? If no then upgrade case has to be mentioned for restoring the job_sort_formula configuration.

@arungrover, In Interface-7 it is mentioned that server will connect to the multiple schedulers on their respective hostname and port (which are actually part of their corresponding sched object).
This requirement implies that multiple schedulers can be started by server on any host other than the host where server is running. If this is the case it would be better if we mention more details on how server is going to start the scheduler present in different host/same host.