Scheduler can spend 94% of its time waiting for job run ACK

Hi,

I wanted to start a sort of brainstorming discussion on how to fix the problem highlighted in the title. I have been investigating the slowness in scheduling cycle in PBS.

In my setup, sched cycle takes around 3 minutes to run 50k jobs in real PBS. I realized that sched was spending majority of its time just waiting for an ACK from server, more specifically, it was spending 161 seconds out of the total time of 172 seconds, just waiting for the ACK. This is 94% of the total time. So, I thought of removing it (https://github.com/PBSPro/pbspro/pull/1597), but realized that scheduler relies on this reply in case a runjob hook rejects a job so that it can free up those resources for other jobs, otherwise such resources might get booked by the same job every cycle and cause under-utilization.

So, I’m hoping that we can come up with a way to work around this. A few possible options:

  1. Add a sched attribute which will tell scheduler to either care/not care about runjob hooks, so that it won’t wait for an ACK if it’s been told to not care about runjob hooks.

  2. Penalize such jobs so that scheduler gives priority to other jobs and those resources get used.

  3. Mark such jobs as Held for 1/few cycles so that those resources get consumed by other jobs. Since cycles will be much faster, such jobs might not have to wait that much.

  4. Make runjob a scheduler hook instead of server

option 4) might intuitively make the most sense, but it’s going to be a lot of work. option 1) seems the safest of the rest, but it’ll restrict users who want a faster sched and use a runjob hook. 2) and 3) both penalize the job, but 3) seems to be better, the penalty is just a delay in scheduling, which might not be that much if sched cycles take just 12 seconds instead of 3 minutes. Let me know your thoughts on these, or if you can think of other ways to solve this.

Also, I’d like to know some numbers on how users use runjob hooks. Do they on average reject more than 10% of jobs that sched asks server to run? more than 50%? Even a rough estimate will help test the solutions out.

Thanks!

Hi Ravi,

AM, for example, executes runjob hook to reject jobs that does not have budget. However, that functionality would not break with this change. The runjob hooks will still, correctly, reject a job. The change is behavior (in not waiting for the ack) is that scheduler will not be able to reuse the resources that were allocated to a rejected job, within the same sched cycle.

I feel that slight change in behavior would be acceptable. Reason? I think we already do that in asyncrun mode, which is now the default.

Long back, when sched ran a job, it would wait for the server to ack, but the server would actually wait for the mom to ack that the job really started. There were mom hooks that could reject the job, or other startup/launch failures could have happened, and the scheduler would then be able to re-use the resources for the next job(s) in the same sched cycle.

With the async run mode, the sched no longer knows about mom side rejections or failures, in the same cycle. Looking at it from a slightly higher level, it is (almost the) same as not knowing that the server failed/rejected running the job (only, the server rejection is a tad earlier, but in a HTC scenario that delay is not significant).

Perhaps option 1 is good enough, people who need to depend on the sched knowing server’s rejection can turn that off…

Reading this, I think the current behavior is a bug – async job run should be fully async.

So, I would go further and say a slightly modified #1 proposal is the right behavior – no need for a scheduler’s attribute (there is already an async run switch, isn’t there?). I also think that a reasonable expectation for a well written runjob hook is that it will delay any jobs it rejects (e.g., by putting them on hold or setting a “run after time X” on them), so #3 is basically taken care of already (by a well written hook).

And… great speedup – cool!

Async run job is not something that has been existing in PBS for a long time. It was added a few years back and IMO before proceeding, we should find out why async run job was done the way it is today. Why did we decide to run runjob hook before replying to scheduler?
Out of all the solutions you have proposed, solution 1 looks promising, I agree that not keeping it as a switch will probably be better. If we keep a switch then we will have issues when the switch is flipped while scheduler is in the middle of the cycle.

I want to play the devil’s advocate here and want to list down the issues I see with this approach.
As far as I can tell, PBS server can reject async run job requests in the following cases:

  • Reject when the job is in invalid state (transit, exiting, staging files, prerun or running)
  • Reject when a job is modified while scheduler was in a cycle. (Only in non throughput mode, may not be relevant here)
  • Reject when subjob (or range of job) ids are incorrect, or not in Queued state.
  • Reject when a parent job is being run
  • Reject when run job hook rejects
  • Reject when subjob creation fails
  • Reject when server fails to enqueue provisioning requests

Now all these reasons sound trivial and seems like it will correct itself in the subsequent cycles. But following are the effects of not reporting async-runjob failures -

  • Scheduler today, upon receiving runjob rejection marks the job as can-not-run and releases (within the cycle) all the allocated resources. In case of subjobs, It also marks the array parent as can-not-run so that no other subjob from that array job is considered to run in that particular cycle.
    Which means if an array subjob is rejected by a runjob hook, and scheduler never comes to know about it, scheduler will continue to run other subjobs from the same array parent until it fails to find a solution. This to me looks like a blackhole kind of situation, at the very least for that one cycle (or maybe more).

  • If Server stops sends async run rejects, scheduler’s internal run counts of jobs and resources will get affected. It will take incorrect decisions in that cycle and try to lower down the preemption priorities of such jobs (because of soft limits). This will make the user/group/project more susceptible to preemption within that cycle.

  • Scheduler will resort jobs (in the case of fairshare, soft-limits) when it does not really have to because the job didn’t actually run.

I will surely sound like a pessimist here, but maybe we can address this some other way, maybe we can find out why exactly server take 160+ seconds to acknowledge 50K job requests where this is not even a runjob hook in place? Fixing the slowness in server will not really take us to 12 second figure but then we will also not loose functionality we have.

Just copying a reply on another thread that I thought is also relevant here…

From my perspective, we should try to make everything in PBS asynchronous. This is the only philosophy that will get us to scale. My thinking is that there’s no reason to wait (block) to verify success, especially if (1) success is the usual outcome and failures are exceptions, and (2) detecting failure later is not catastrophic, and (3) we can detect failure later and handle it appropriately (which might just be to log it).

1 Like

I tried moving reply_ack() to the first line in req_runjob(), it still didn’t make much difference. So, it’s DIS + IO + whatever time the server spends getting done with its current tasks before addressing the runjob request since the server is serial. The cpu percent didn’t go over 20% for the server process during my tests.

Thanks for expressing your concerns, we should definitely test out the effect of async on the overall resource utilization, see how we can mitigate any drop that it might cause. About preemption, when sched tells server to preempt jobs which aren’t running, when server rejects the preempt request, maybe scheduler can then adjust its run count and such irregularities can get corrected. In fact, it should already be doing this right? some jobs can be rejected by mom, so preempting some jobs can fail even today.

Overall, looking at how big an impact this is making, I think we need to think about how such things can be worked around to allow async as they might not be worth the performance penalty that we are incurring.

This is not what I was trying to say. Scheduler never tries to preempt a job that it just started in the same cycle. What I meant was that if asyncrunjob is always considered as success then scheduler will increase the run job counts for that particular user/group/project. This means other running jobs of this entity (user/group/project) now have lower preemption priority and those jobs can now be suspended. This means this user/group/project is penalized because scheduler assumed that its job ran.
Another thing to keep in mind is that if in future scheduler moves to a model where it maintains a cache across cycles and just relies on updates from the server, this will becomer server’s responsibility to tell scheduler that the job that it assumed as running isn’t actually running.

I agree with you Bill, but I think it mostly applies when success is just ignored and has no effect. In this case, success affects the preemption priority of jobs of the given entity, it makes scheduler re-sort jobs, it makes scheduler continue to try to run subjobs from the same parent when it shouldn’t.
[edit] maybe it all depends on what’s the frequency of asynchronous runjob failures. Maybe it is not a big thing and I am over thinking :slight_smile:

Ok, but this exists today as well right? scheduler assumes that if asyrunjob is success then the job ran, when it might actually get rejected by the mom later. Scheduler updates its run counts etc. and might incorrectly preempt jobs today as well right?

Same with re-sorting jobs and running subjobs from the same parent, if the jobs get rejected at mom instead of server, we’ll see these issues today as well right?

So, maybe these problems happen rarely, that’s why nobody has complained about them. For users who want absolutely consistent functionality, they can choose to turn high throughput mode off at the cost of performance.

That’s true these things happen when mom rejects a job too. Although in those cases PBS server holds the job after a few retries which isn’t the case when server rejects it. I also think that on server rejects (like when budget is not there to run a job) PBS shouldn’t penalize the job by putting it on hold.

Maybe you are right and it isn’t that big a problem and I can see some consensus building as well. So I’ll go with whatever everyone decides.

Thanks

For the same reasons that we are discussing today. We were trying to keep the scheduler consistent with exactly what the server knows. (in hindsight, i think we were shortsighted then :slight_smile: ) It is a fact that the server itself has stale data. We talk about scheduler doing better resource utilization by reusing the resources from a (server) rejected job for other jobs in the same sched cycle. However, we do not consider the fact that the server anyway has stale information about what jobs might have ended on nodes (so the scheduler is actually not utilizing those “free” resources anyway. And that is okay, that is how things can be in a large distributed system. Every component cannot have the exact atomic view of the world - and if we attempt that we will severely hamper scalability.

I think I explained to myself that the slowdown due to the rdrpy() is not due to a DIS slowless or network IO lag (The ack was moved to the very first line i server on receival of the request, and we know that the data packet in the ack is miniscule). The reason is the lock stepping of the schedulers progress with the server. In other words, it is the effect of making a blocking call, rather than a truly asynchronous one. So, just replacing DIS with flatbuffers will not make any difference in this case (it will in other case of larger data transfers, but not this one). Sure, the server can be made faster, but unless the entire processing time in the server is reduced to zero, we can never match the performance of an asynchronous call. Now, yes, some optimizations in scheduler (for the current cycle) will be compromised since the information is stale (that is always the case in a distributed system), and the “system” will have to “catch up” and thus large distributed systems side towards “eventual consistency” rather than “absolute” consistency. CAP Theorem: Revisited - CAP refresher

Thanks for the inputs guys. It seems like we are converging towards the following:
Scheduler uses pbs_asyrunjob() instead of pbs_runjob() only when high throughout mode is on. So, it should be reasonable to make pbs_asyrunjob() completely asynchronous for users who want high throughput. If such users want to use runjob hooks, they should ensure that their hook handles penalizing/de-prioritizing jobs which it rejects so that such jobs don’t repeatedly get scheduled and rejected by the hook. For users who prefer to have simpler runjob hooks and let PBS handle the rejects, they can turn high throughput mode off, at the cost of performance.

Does that sound like the right direction to go?

1 Like