Throttling job attribute updates from scheduler to server

Hi,

I’m proposing 2 enhancements:

  1. A sched object attribute which will control how often the scheduler sends job attribute updates to the server
  2. Shifting the responsibility of accruing eligible time from server to scheduler, this will allow 1) because then delaying accrue_type updates won’t delay the accrual of eligible_time for a job

The motivation is performance. In my test setup, with 100k jobs, 50k ncpus, node grouping on and eligible time on, the scheduling cycle took around 25 minutes. With attribute throttling, it took only around 8.5 minutes. That’s ~3 times faster.

Sched sends updates for the following job attributes:
ATTR_estimated.soft_walltime: updated when a job exceeds its ATTR_l.soft_walltime, if soft_walltime has been set for the job.
ATTR_accrue_type: updated if the site is using eligible time and a job starts/stops accruing eligible time because it got preempted, or can’t be
run, or is an arrayjob who’s subjob was run by the scheduler. This might be the only attribute which shouldn’t be delayed.
ATTR_l.walltime: updated for jobs which are run via shrink-to-fit
ATTR_pset: updated for any job that’s run if node grouping/placement sets are used
ATTR_sched_preempted: gets unset when a previously preempted job is run
ATTR_estimated.start_time/exec_vnode: gets updated when a job is calendared by the scheduler.
ATTR_comment: gets updated when a job cannot be run, usually only once per job, not every cycle.

Trade-off:
Users will see a delay in the job attributes being updated on their jobs, the updates will be dependent on how often sched cycles occur. The delay is customizable. For sites which don’t want any delay can choose to turn off throttling altogether, in which case the behavior will be similar to what it is today at the expense of worse performance, with one exception: eligible time today is accrued in the server, so when users do a stat, server can compute the up-to-date eligible time value and return it to them. Now, even if admins turn throttling off (i.e - scheduler sends attr updates every cycle) the value of eligible time accrued, as seen by a user, will be the value that sched sent to the server, so it can be stale, although it will be accurate at the scheduler’s end and the scheduler will schedule jobs correctly, as it does today.

Before I create a design document, I wanted to know whether the trade-off is acceptable or not. So, please provide feedback and let me know. Specifically, requesting @bhroam, @scc, @billnitzberg and @subhasisb for opinions.

Hey @agrawalravi90
You can’t reverse the ownership of accrue_type. The scheduler is not the only entity that sets it. The server has criteria that will change the accrue type. I think the right answer is to set the accrue type to eligible by default. This is the most likely value for the accrue type. This way the scheduler doesn’t have to change it. If it does have to change it, the job will only gain a little bit of eligible time incorrectly.

You can not delay the setting of walltime for STF jobs. It HAS to be set before we ship the job off to the mom.

sched_preempted is a simple one to get rid of. We should have done it already. When the server took over a lot of the preemption functionality, it took over setting this attribute. There is nothing saying we can’t unset it in pbs_runjob()/pbs_sigjob() when the job is started again. This is one less thing to update.

You won’t get much by delaying the update of estimated.soft_walltime. It is only set when a job’s soft_walltime is extended. It extends by 100% of its soft_walltime.

estimated.start_time/exec_vnode can change every cycle, but it is for a small handful of jobs (backfill_depth). Still not gaining much here.

The main one is the comment. It will likely get set and not change. The only thing you gain by delaying this is the case where a person is submitting a ton of jobs. You might run a bunch before they get their comment set. Otherwise you are just putting off the pain. Not only that, customers want to know why their job is not running.

As for your 8.5m number, I don’t know if I believe it. This means you probably set only 1/3rd of the attributes. You still need to set the other 2/3rds, but they will happen in future cycles. All 25m will be spent, but just over time. I’m not saying this is a bad thing, I’m just saying your numbers are overly optimistic.

Bhroam

I wonder if there is a way to eliminate or amortize all (or almost all) the attribute setting… ideas:

  1. figure out which attribute settings could be handled without sched-server communication, e.g., are some attributes no longer needed?, could we refactor code to move calculations entirely into the server (or into the scheduler and eliminate the attribute entirely)?
  2. if there isn’t one already, create an API to set multiple attributes in one transaction, and bundle, e.g., at the end of the scheduling cycle do one transaction to set all the attributes that need setting.
  3. if a transaction is happening anyway, can we create a new API that allows issuing the attribute settings at the same time (e.g., bundle the “run” and setting wall time for shrink-to-fit jobs together)?
  4. as Bhroam suggested, maybe changing some defaults would allow setting attributes less often, or we could even have an attribute automatically revert to it’s default if it hasn’t been updated in a while?

Also, I know it’s a benchmark meant to stress the scheduler (not necessarily real-world realistic), but we should be striving for scheduling performance that “starts all jobs that should be started within seconds of becoming eligible to run” (not minutes). So, if we continue to have a single scheduler per partition, that means scheduling cycles << 1 minute.

It took 8.5 minutes when the scheduler doesn’t send ANY updates. With sched throttling set to send updates every 5 cycles, I got 8.5, 8.5, 8.5, 8.5, 17 minutes for the 5 cycles. The cycle which sent updates took 17m compared to 25m because I was not sending the updates for pset in my branch. My thinking was that If we remove pset then we can delete pending updates for jobs if they run before the 5th cycle (i.e- no need to send updates for comment or accrue_type if the job ran by the time we decided to send the updates out). I didn’t know about STF jobs, if they can’t be delayed then we can make an exception for them, hopefully they won’t be too large in number. I was focusing on ‘pset’, ‘accrue_type’ and ‘comment’ as they seemed to be the most prolific.

Can you please explain this? I thought that scheduler was the one which looks at limits and decides what the accrue_type of a job should be.

For sites which use limits, we might still end up with a large number of alterjobs to make eligible jobs ineligible, but if we can’t do anything else about accrue_type then ya, this would be better than what we have today.

That’s exactly it, this will be useful for sites which have a large volume of jobs which go in and out of the system quickly and would prefer faster scheduling over some delay in PBS telling users why their jobs aren’t running. Since the setting is customizable, admins can choose to set throttling lower, or turn it off completely if they don’t desire it.

interesting thought … ‘pset’, ‘comment’ and ‘accrue_type’ are the attributes for which sched sends the most updates to server. We are deprecating pset. I don’t think we can do much about the comment attribute, it needs to come from sched and needs to be communicated to server. accrue_type is tricky, it is needed for eligible_time, which today is owned by server, but sched kind of owns accrue_type, it is the one that decides if a job should accrue eligible time or not. That’s why I had proposed moving ownership of eligible_time to sched as well.

We already do this

That’s a good thought, but it will require a change in pbs_runjob IFL. I guess we could instead add a new IFL call which could do this, it will certainly be useful and get rid of alterjobs calls for ‘pset’, ‘accrue_type’ and ‘walltime’ for STF jobs.

Re: “at the end of the scheduling cycle do one transaction to set all the attributes that need setting”

I was thinking we could delay all attribute changes to the end of the cycle and do a single API call to set them all, perhaps asynchronously. (Of course, we could also bundle changes where it makes sense, like wall time for shrink-to-fit.). If we already do that… then… ouch.

So sched today bundles all attr updates for a job into one call, but it’s still one call per job. Did you mean just a single/a few IFL call for all jobs? That’s an interesting idea … many updates might be common for many jobs, so if we could figure out a way to say “jobs 1 through 10k, update comment to xyz” that’d be great. Or am I totally misunderstanding what you are suggesting?

Yep, I meant one call per scheduling cycle (not one per job per cycle). Thx.

Ok, thanks for clarifying. I’ll have to think more about this. Initial thoughts: sending all updates in a single IFL call might hold up the server as it is single threaded, so maybe it would be better to club similar updates together, and for multi-server we’ll also have t club the updates relevant for each server since each job will be owned by a particular server. So, the scheduler will have to do some work to create these buckets of updates, which might not be a lot of work. Performance wise, we’ll have to see how expensive these calls will be at server’s end, but it’ll reduce the total number of alterjobs from tens of thousands to just tens, so it might be worth trying out. Let me know what you think.

Hey @agrawalravi90
I like @billnitzberg’s idea of making this call asynchronous. How about we move the throttling to the server. It’s got to be the save to the database that’s slow. The scheduler can make its alterjobs, but the server can immediately reply. The scheduler doesn’t do anything if the alter fails, so there really isn’t any need to hold things up until the server can tell us that. The server can keep a queue of alter jobs. When it isn’t doing anything else, it can start applying alterjobs. This way you won’t have that horrid 17m cycle every so often. That 17m is spread over time whenever the server is free.

Bhroam

Unfortunately not, when I remove all other configurations and use the default for the same test, scheduler runs 50k jobs, cannot run next 50k and sends attr updates for only the comment attribute, which does not get saved to db, the sched cycle still takes around 15 minutes, 7 of which were sending updates for the comment attribute.

@bhroam and @billnitzberg I have another proposal: make the job attr updates from sched to server asynchronous, similar to what we are discussing for asynrunjob here: Scheduler can spend 94% of its time waiting for job run ACK

When I make the job attr updates from sched asynchronous, the sched cycle time comes down to 8.5 mins, most of it was just time taken to run the jobs. If I combine the changes for asyrunjob as well, the sched cycle time goes down to just 1m50s, compared to 25 minutes before

Hey @agrawalravi90,
I don’t think I was clear in my note. I was suggesting we make job attribute updates asynchronous because the scheduler doesn’t really do anything with the reply. It’ll report the failure, but that isn’t really that interesting. It usually reports a failure because the job was deleted between when the cycle started and when the scheduler looked at it. That really isn’t interesting and might cause alarm because the scheduler failed to do something.

So what I’m suggesting

  1. The scheduler sends the job updates as normal
  2. Don’t have the scheduler wait for a reply from the server. Alter and march on.
  3. Hold the alter job requests in a queue on the server.
  4. When the server is not busy with other important tasks, it can process the queue.

Another slightly different idea is to make it a priority queue where higher priority attributes like accrue_type can be processed first.

What this does is eliminates that harsh 17m cycle you get 5 cycles in when you do send the updates.

Bhroam

From my perspective, we should try to make everything in PBS asynchronous, so this sounds great to me. This is the only philosophy that will get us to scale. My thinking is that there’s no reason to wait (block) to verify success, especially if (1) success is the usual outcome and failures are exceptions, and (2) detecting failure later is not catastrophic, and (3) we can detect failure later and handle it appropriately (which might just be to log it). Thx!

thanks for clarifying Bhroam and sorry for misunderstanding what you had suggested. If we make alterjob updates asynchronous, do we still want to throttle attribute updates from sched to server? I think it might still be useful as we might fill up TCP buffers otherwise, which might start causing slowness. I think I saw some of this in my test: the first 50k runjob calls + 50k alterjob calls to update pset + 10k alterjobs to update eligible time & comment got done in just 10 seconds, but the remaining 40k alterjobs took 1m40s. So, it might still be useful to throttle, or consider Bill’s suggestion of clubbing updates in fewer alterjob calls.

@billnitzberg thanks for your inputs! It sounds like we all agree that at least pbs_alterjob calls from scheduler to server should be made async.

Maybe we don’t need throttling any more. I had thought that saving the jobs to the database was slow, but if it isn’t, then maybe we don’t need to care. We just need to worry about filling up the TCP buffers like you said. We are sending 100k updates. The server will definitely not be servicing them as fast as we are sending them.

Is there a way to test this? Test if we can fill up the buffers by sending too many job updates?

Bhroam

I tried running the same setup, 100k jobs, 50k ncpus, node grouping and eligible time on, with my attribute throttling POC: sched cycle took 16 seconds compared to 1m50s before. So, I think it should still help.

The async stuff is pretty good. However that will not reduce the amount of traffic to server, and thus server will still have to do a whole lot of work. If we can reduce the amount of work we need to make server do, then that would obviously help - and the throttling would have helped there.

Bundling all the actions into one call might not be required any more, since that’s what async achieves ballpark. Both bundling and async makes the server do the same amount of work, and basically cuts the IO jitter - in fact both may have the same volume of data being transferred anyway (almost).

So i think we should do the async and then also do the throttling.

Thanks for your inputs Subhasis. @bhroam let me know what you think, thanks.

The only thing throttling does is remove the need to send some updates because the job ended between when the scheduler wanted to send the update and when the update is sent. This will add more work to the scheduler because it will have to determine that a job has ended. This is not something the scheduler does. It queries the entire universe and acts on what is there. It doesn’t care what went away between the last cycle and this one. Now when we move to a persistent scheduler, it will handle this instance.

If we move back to my idea of the server throttling the updates, then it just needs to keep a queue of updates. When it is not busy, it will handle updates. If a job is not there anymore, the find_job() will fail and it has done very little work. If we let things buffer in the TCP buffers then the server will handle these in order, and not do them when it isn’t busy. I personally think a runjob request is higher priority then an alter of a previous job (especially if we’re thinking of delaying them anyway), but if we’re handling IFL calls in order, we’ll handle many alters before a run.

Bhroam