New sched attribute to control runjob wait + making pbs_asynrunjob truly async + deprecating 'throughput_mode'

This follows the discussion here: Scheduler can spend 94% of its time waiting for job run ACK

Scheduler can spend up to 94% of its total time waiting for ACK from server for runjob requests. This can obviously have a big impact on performance. So, I’m proposing adding a new throughput_mode value which will make scheduler to not wait for ACKs from server for runjob requests. See Scheduler can spend 94% of its time waiting for job run ACK for more details on the motivation.

Design document: https://pbspro.atlassian.net/wiki/spaces/PD/pages/1633517573/New+sched+attribute+to+control+runjob+wait+making+pbs+asynrunjob+truly+async+deprecating+throughput+mode

Please provide feedback on this, thanks.

1 Like

Hi Ravi,

I’d personally prefer a new admin flag to control whether to make it truly async or not. We have a fast runjob hook which can reject less than 10% of jobs in a given cycle. If 1000 jobs get rejected, then we have 1000 jobs’ worth of resources sitting idle. We also don’t attempt 50k jobs in a single cycle, so the current scheduler performance is acceptable even with the ACK delay.

Thanks
Steve

Thanks for the inputs Steve. How about we add an additional value to the ‘throughput_mode’ attribute itself? Maybe change it to a string attribute with values “low”, “med” and “high”, where
“low”->pbs_runjob, which will wait for reply from mom->server->sched
“med”->pbs_asynrunjob + wait for reply from server, but not mom
“high”->pbs_asynrunjob alone, which won’t wait for a reply at all

Also, just curious, does your hook penalize the jobs that it rejects, or those which get repeatedly rejected?

HI Ravi,

I like the low/med/high idea. That’s better than adding a whole new attribute imho.

The jobs aren’t penalized. They could legitimately run the next cycle. Therefore, penalizing or holding them doesn’t help in my case.

Thanks for asking.

1 Like

Thanks Steve, I’ve modified the proposal to add “low”, “med”, “high” to throughput_mode with “med” being the default.

I like the idea of low, med, and high as well. The difference to the admin is which hooks get the chance to reject the job. With low, both the runjob and begin hooks can reject the job. With med, only the runjob hook can reject the job. With high, neither hook can reject the job and it is up to the admin to somehow penalize the jobs.

My only other thought is to automatically switch from med to high if there are no runjob hooks. At the beginning of the cycle we can check if there are any runjob hooks. If there aren’t any, we set throughput_mode = high. If runjob hooks are submitted after the cycle starts, then so be it. We won’t wait for them until the start of the next cycle. If the admin wants to take care of the penalizing themselves, they can override the value of throughput_mode and set it to high.

Bhroam

@agrawalravi90 I’d put another restriction in place of not changing the throughput mode while a cycle is running because if it changes from med to high in the middle of the cycle then scheduler will just wait for an ack on the connection while server stop sending it.

Another way to solve this problem could be to introduce a third IFL call to run truly asynchronously. Server will decide whether to respond or not based on the batch request. This way, you don’t have to disallow admins from changing it while scheduling cycle is running.

Interesting thought, I like it.

That might be a bit unreasonable, admins will have to turn scheduling off and only then be able to set this.

I thought about this a bit, and I think that might actually be the best way to go about it. There can be multiple schedulers, each with its own value of throughput_mode. so, it might be the simplest and safest to rely on the type of batch_request received. If we could change pbs_runjob() to add a new argument to control this, that would have been the best way. But since modifying existing IFL signatures is a strict no-no, this might be the way to go. Now on to the most difficult part, what to name this new batch_request :slight_smile:

@bhroam and @arungrover I’ve addressed your comments and updated the doc. Please let me know what you think. Thanks!

Hi – I’m not sure I understand why we need a “medium” setting. As I understand, @spschaller expressed a concern about low utilization caused by a runjob hook rejecting a bunch of jobs. As the scheduler runs lots of scheduling cycles (every few seconds), wouldn’t any unused cores (caused by rejected jobs) simply be filled in the next cycle? (And, that next cycle would come much sooner… because… asynchronous.)

Also, “throughtput_mode” is an existing scheduler setting with type boolean. We shouldn’t break backward compatibility by changing the type to accept low, medium high. In a case like this, we should introduce a new setting (if it’s needed).

Hi Bill,

In our case, no, the next scheduling cycle wouldn’t necessarily the use the resources since the same condition which rejected the jobs would likely still be present in the next cycle. In other words, until the condition clears then we would be keeping a chunk of resources idle from cycle to cycle. It’s best, for this case, to have runjob tell the scheduler to assign the resources to other jobs.

If you need another attribute to handle new features, then so be it. I just don’t want to break the existing behavior without reasonable recourse.

Thanks
Steve

Hi Steve! OK, got it – we don’t want to break backward compatibility for this behavior either.

I’m curious – can you say more about the higher level goal with using runjob hooks to reject a bunch of jobs (for a while, but not too long). One reason I ask is that having a lot of jobs at the “head of queue” that continuously get rejected may cause other capabilities to behave in unexpected ways (fair share, eligible time, job priorities in general, backfilling, etc.). I’m concerned that this will work for you today, but when you later decide to turn on another feature, there will be unexpected interactions… and, there might be a better way :-).

Thx!

Thanks for the inputs Bill. About backward compatibility, I was thinking that existing values can still be handled, just as strings “0”/“1”/“t”/“f” etc. instead of booleans, while we tell the users that they are deprecated and that they should use new values. Will that not work?

I don’t think we can just blatantly say we will be 100% backwards compatible. There is a cost that goes with it. We are a >20 year old code base. It is already difficult to maintain the code base. Even straightforward changes can have some unexpected side effects. Every time we just ‘add a new attribute for a new thing’, we leave the old attribute and the old behavior in the code. Not only does it add cruft and make PBS harder to maintain, it makes the product less intuitive. We need to come up with a new attribute name which is probably not going to be as intuitive as the existing one.

Now backwards compatibility IS important, but I think each modification should be looked at individually. If there is little chance of someone depending on the exact same syntax, we should make a modification ti better the product.

I believe this is one of those cases. While I have no real data to back this up, I do believe throughput mode is set by default when PBS is installed, and never changed. There will be old values which can be mapped to new values on upgrade. If the old throughput mode is false, we set it to low. If the old throughput value is true, we set it to medium.

@billnitzberg you have also said in the past that if we are modifying something the admin only uses, backwards compatibility doesn’t need to be as strict. This definitely falls into that camp.

Bhroam

1 Like

Thanks @bhroam – I was more concerned about compatibility with an admin who has a saved “qmgr print server”, and reads that back into PBS, esp. during an upgrade, but also to restore a working configuration from the past. If this can be done cleanly, then I have no objections.

However, changing the type of an attribute from boolean to enumeration, and still accepting boolean values, is … ugly. My view is that it is very likely to cause confusion when we all forget this conversation and an admin (or developer) sees “true” as a valid setting and makes assumptions.

A more specific attribute with more specific names (and deprecating the old attribute) would make the product cleaner), e.g., something like “runjob_wait = { none, server_hooks, execution_hooks }”, with default of “none”.

In any case, I defer to consensus among the maintainers – this is an admin setting, so, as you say, there is more flexibility in how we maintain compatibility.

Thanks!

Thanks @billnitzberg, @bhroam what do you think? I’m a bit torn … I like the idea of the attribute being specific to hooks as it’ll give admins more control, but I also like the idea of 1 generic setting “throughput_mode” which we can use in the future to turn off various features which increase performance at the cost of restricted functionality.

@billnitzberg came up with an interesting point. We can’t break qmgr -c ‘print server’ and the redirection of that back into qmgr. The reason being one of the valid upgrade methods uses this (migration upgrade). So we would either have to accept the old True/False/T/F/t/f/1/0 or create a new attribute. I kind of like his choice. With throughput_mode, it isn’t necessarily obvious that your begin hook rejecting the job won’t make its way back to the scheduler. This was just a side effect of us using pbs_asyrunjob(). With an his suggested attribute, we are being explicit. We will just have to explain in the guides how different settings affect performance.

As a note, unless we change our default, it would be server_hooks.

Bhroam

@billnitzberg and @bhroam Thanks for the clarifications, I’ve modified the design to add a new sched attribute instead and deprecating throughput_mode. Please review it once and let me know if it looks ok.

Hey @agrawalravi90
Just a few edits. I am happy with the meat of the feature.
I’d have a section at the top which describes what this all means. What I mean by that is that if we are waiting for a hook, then if that hook rejects the job, that reject will make it back to the scheduler. If this happens, the scheduler will know those resources are not used and they are available for the rest of the cycle. If we don’t wait on a hook, those resources will be considered to be used for the rest of the cycle. At the start of the next cycle, we will see they are not and start over.

In the different settings, I’d be explicit on what hooks we are waiting on. For svr_hooks, we’re waiting on runjob_hook. For exec_hooks, we’re waiting on the runjob_hook and the execjob_begin hook.

I’d rephrase the guidance at the bottom. Say if the none setting is used, then we are waiting on neither hook. This means that we assume the resources for the job are used for the rest of the cycle. Start off by saying if the execjob_begin hook consistently rejects the job, once the runcount > 20, the job will be held. This will not happen for the runjob hook. If the runjob hook consistently rejects the job, those resources will not be used and the system will be underutilized. You can then go on to your advice about what to do.

As a note, you still have throughput_mode=low/med/high in places throughout the design.

In the internals, don’t just say the batch requests, also say the IFL calls. Also, do you want to add something about the new IFL call which is replacing pbs_asyrunjob()?

Bhroam

Thanks @bhroam I’ve added more details about each mode and made each mode specific to the respective hook.

About new IFL call, I don’t really want to make pbs_asynrunjob_ack() a stable, supported, externally visible interface unless it’s necessary. That’s why I’m not documenting it explicitly and just saying that it’ll be the function called to make the new batch_request. I don’t imagine a use case for it outside of the scheduler telling the server to run a job in a certain way, so it’s a purely internal interface in my mind. Let me know what you think.