Throttling job attribute updates from scheduler to server

Exactly. I think letting TCP buffer things may not be the right approach to throttle things. As you mentioned, a runjob request will be queued behind several hundreds of alterjobs, and will not achieve any real benefits. Besides the server does not know what is waiting for it in a queue, and it generally considers the schedulers requests with priority, so it usually goes out of its way to handle scheduler messages. There could be 2 options:

  1. The server reads all alterjob requests and then queues in an internal queue and apply them in pockets of intervals (so server would still be throttling them, but not depending on a TCP queue); It would still be able to act on a runjob req as quickly as before.

  2. Maybe, we need to think of this beyond throttling. Maybe the scheduler can actually cut down on some of these at the source itself (scheduler). For example, it might be necessary to update attributes of jobs with very small walltimes? There might be more similar criteria…

What say?

We don’t need to wait for the job to end right? We should be able to delete all the “can’t run” related attribute updates of a job when it runs, e.g - comment and accrue_type don’t need to be updated if the job runs. There might be some job run related attribute updates that need to be sent, but since we are removing pset soon, walltime for STF jobs might be the only job run related update, which hopefully will be a small number, and we’ll need to send it immediately anyways, so we won’t cache it. So, overall, we might be able to reduce many attribute update requests if most jobs run by the time sched has to send updates. And I didn’t see any significant performance loss at scheduler side in my POC because of the increased effort in maintaining the attribute update cache.

That won’t really help though. Sched took 2 mins even with async run + async alter, so sched isn’t waiting for server at all, still taking 2 mins instead of the 16 seconds without alters. It might help the server, but the server doesn’t really take that long: when sched takes 2 mins, server takes only around 10-15 seconds more than sched. So, throttling at server side might not be that valuable.

@bhroam Just a gentle reminder, would be great if we could converge on this soon, thanks!

Hey @agrawalravi90,
Yes, it is true that we can delete the pending update when we run the job. This will take care of most of the cases. The scheduler still has to handle the case of a qrun -H or a qdel.

As for your numbers of 16s vs 2m, that is not really the reality though. Yes 16s looks nice, but you are just putting off the pain. 5 cycles later, you’re going to have a cycle that takes 1:50. That isn’t going to be 5 2m cycles we are avoiding either. We only really update the comment once, so if we had that 2m cycle and no new jobs came in, the next cycle would be in that 16s range.

I still don’t like the idea of the scheduler handling it because I don’t think it is the schedulers job. The server is the one who really cares, so if we send them all and the server can then handle them when it wants.

In any case, we’re talking multi-server here. 2m will be much faster because there are N servers handling requests instead of just one.

Bhroam

I agree that sites which don’t see lots of new jobs every cycle probably won’t benefit much from it, but it will be at worst as good as what we have today, but at best, it will be much faster.

I don’t think that multi-server, or anything we do in the server really, can bring down the time taken by scheduler. The 2 minute sched cycle happens when there are async run jobs + async alter jobs. So, even if there are multiple servers, since the calls are async, sched will still take the same amount of time. In fact, with multi-server, things might slow down for the scheduler because each request will need to go to a different server, so there’ll be some additional processing needed to figure out which server to send each request to.

I disagree it will be “much faster”. Yes, 16s is less than 2m, but that 2m is coming 5 cycles later. So at worst it is as fast as it is today with extra complexity in the scheduler. At best, it is probably not that much faster. Keep in mind that your original numbers were based on much longer cycles. The speedup you saw was from 25m to 17m. A lot of jobs started and ended in the 4 8 minute cycles. In our case, we’re talking about 4 16s cycles. The 5th cycle will probably not be much faster than the original 2m cycle.

From what I understand the sharding logic is very simple and quick. Deciding which server to go to is not going to take any extra time. You do have a point about how multi-server won’t help now that we have async calls.

Bhroam

Without any throttling, each cycle will take 2m right? 1 in 5 cycles take 2 minutes with throttling set to 5. I now feel like we are not on the same page

After async run + alterjob:

5 cycles without throttling:
2m each, so 10m

5 cycles with throttling set to 5:
first 4 cycles take 16s each, so 64 seconds
5th cycle takes 2m
Total: around 3m, which is much faster right?

Hey @agrawalravi90,
You have a point. I wasn’t thinking things through correctly. 3m vs 10m is quite a difference.

I have another suggestion that is a hell of a lot simpler and will end up with the pretty much the same results.

What I understand from your current design:
We set some tunable to the number of cycles we are going to wait. For this example, let’s say 5.
When the scheduler wants to send attributes to the server, it instead stashes them somewhere in global space. In future cycles, if the scheduler wants to update the attributes again, we update our stash. If a job runs, we dump the stash, After 5 cycles, we submit the stash to the server. There is still an issue of how to dump the stash when an admin does a qdel or qrun -H.

My idea is much simpler and will achieve mostly the same thing. It works on the fact that the scheduler does a lot of the same work each and every cycle. Since we’re not going to send the updates for 5 cycles, do we care what the scheduler wanted to send in the first 4? Why are we stashing them to update it 4 times before we finally send it?

Why don’t we have a counter on the scheduler object that is the number of cycles since the server started. The scheduler will query this counter. If the counter % 5 == 0, we send updates for that cycle. It will update some jobs the first cycle we see them. Other jobs will have to wait the full 5 cycles before we send an update. We don’t care about jobs that were qdel’d or that have ended between the last update because they will be gone. All in all, we still have 4 16s cycles, and 1 2m cycle. In this case it literally IS 4 16s cycles and 1 2m cycle. In your case, it won’t really be 4 16s cycles and 1 2m cycle. That 3m will be more gradual over all 5 cycles because all the jobs won’t come in at once. The 5th cycle for each job won’t come all at the same time. It will be spread out over all the cycles.

What do you think?

Bhroam

Thanks for the suggestion @bhroam I had also thought about this but didn’t think it would work because the scheduler itself relies on the attribute updates in some cases. I guess the most problematic one is eligible time, how will we account for that in the next cycle?

We need to either
A) tell the admin that using this feature will affect acquiring of eligible time. It can take several cycles for a modification of accrue_type to be updated. This means that the job will accrue the non-updated type of eligible time until the update is sent.
B) We do not delay sending accrue_type. I don’t really see this being a problem. We don’t set accrue_type often. We especially won’t set it often if we start jobs off accruing eligible time. This should be the majority of the jobs.

Even though when we designed the eligible time feature we came up with all these different type of times a job could “accrue”, it really comes down to two things: to accrue eligible time or not. By starting the job in initial time, we’re not accruing eligible time. There is no real difference between that and setting it to ineligible time to start. I don’t see a problem fudging on the side of helping the job rather than hindering it.

Bhroam

Ok, I guess that would be a good enough first step (option B + default accrue_type = eligible time)

About the throttling strategy that you suggested, some questions:

Why not just keep the counter in scheduler itself? I don’t really see any benefit of querying it from server other than to avoid a persistent global in sched.

Not sure what you meant by this, how do we decide which jobs will be updated and which won’t? Did you mean to say that we will send updates like walltime for STF jobs or accrue_type for whichever jobs need them and updates for the rest of the jobs will be ignored until the 5th cycle?

Again, I’m not sure I understand. Are you suggesting that we send updates for a job after it’s been seen by the scheduler 5 times?. How will we keep track of the number of cycles for each job? won’t it require us to add a job attribute, which will need to be updated by the sched itself every cycle?

It’s a counter, right? We send the updates every 5th cycle. If a job is submitted in cycle 3, it a couple more cycles before being updated. If a job comes in the cycle after we send updates, it waits 5 cycles.

Maybe I misunderstood what you were proposing, I thought you were proposing keeping updates for 5 cycles. Since jobs are submitted in different rates, this means we’d be sending updates every cycle for updates we’ve held for 5 cycles. Jobs that were submitted 2 cycles ago need to wait 3 more cycles to have their updates sent. Were you suggesting having a counter like I am and submitting evert 5 cycles regardless how many times we’ve seen a job?

I was talking to @arungrover today and he reminded me of something. It isn’t going to be 10m vs 3m. After we update the comments and accrue_type once, we won’t do it again unless it changes. This means each cycle won’t be 2m. The first one will, and the rest will be much shorter.

Can you run a test? Can you run 5 cycles without the throttling RFE? See the cycle lengths for cycles 2-5? I’d be surprised if they are very long. If we really are just delaying the 2m cycle for 5 cycles, where the other 4 cycles would have been very short anyway, this RFE won’t help much.

Bhroam

Yes, bascially scheduler sends updates after every n cycles, n = 5 in this case, regardless of how many times it has seen each job. This will be very simple but still effective.

My test case submitted 50k jobs every cycle. If there aren’t any new jobs every cycle, then yes, throttling doesn’t make sense. The idea is that this will be useful for sites which have a large job submission rate and see many new jobs every cycle. For sites which don’t have a large job submission rate, they probably won’t need many of the performance enhancements that we’ve been planning, multi-server, async job runs, multi-threaded scheduler, etc. Such sites can set n to 1 and scheduler will then behave exactly the way it does today.

I feel this will help with server load as well, enhancing scalability. If we throttle, we essentially cut down the total number of requests to server. In a large distributed system of exascale size, it could be okay to have updates be seen a bit later, especially if it is mostly info to the user, like job comments. For example, it might be okay to wait a while, (more than a cycle) to update why a job is not running. It might be equivalent to a “very long scheduling cycle” - and then if it runs in the meantime, well, we saved an update. If it does not run even after couple of cycles, the comment appears…

I mean, the user does not have to bother that the comment must be on the job in the first cycle itself - in fact the end-user would not know about the cycles anyway, no?

Now, of course, how long is too long is the question?

You have a good point that there is little difference between a long scheduling cycle, or several short ones where we send updates at the end of the Nth one.d

I still don’t think the machinery if necessary for keeping track of all of the updates per cycle. The scheduler wants to update attributes every cycle. If we just update the attributes every Nth cycle for every job, I think it will be the same.

Bhroam

I’m ok with the idea of sending them every Nth cycle without caching them in sched, I was just hoping to also throttle accrue_type, but maybe this is good enough for now along with changing the default value of accrue_type to eligible so that sites which don’t have strict limits in place can still benefit from throttling.

Alright, sounds like we are converging towards this:

  • Send “Cannot run” type attribute updates from sched every Nth cycle, N is admin configurable
  • If “accrue_type” needs to be updated for a job, then we’ll ignore the throttling window & send all attribute updates for that job.
  • Change the default accrue_type of jobs to “eligible”, this should help reduce the accrue_type related updates.
  • All job run type attribute updates will be sent like before. We are removing pset soon, so hopefully these updates won’t be too prolific.

Does that sound good?

1 Like

Nice summary @agrawalravi90
I like it

Bhroam

1 Like

Great, thanks Bhroam.

@billnitzberg and @subhasisb Let me know if you have any further comments, otherwise I’ll proceed with this.

1 Like

Hey all,

I’ve created the following design document for this:
https://pbspro.atlassian.net/wiki/spaces/PBSPro/pages/1684045848/New+Sched+attribute+to+throttle+job+attribute+updates

Please provide comments, thanks!