Just a few edits. I am happy with the meat of the feature.
I’d have a section at the top which describes what this all means. What I mean by that is that if we are waiting for a hook, then if that hook rejects the job, that reject will make it back to the scheduler. If this happens, the scheduler will know those resources are not used and they are available for the rest of the cycle. If we don’t wait on a hook, those resources will be considered to be used for the rest of the cycle. At the start of the next cycle, we will see they are not and start over.
In the different settings, I’d be explicit on what hooks we are waiting on. For svr_hooks, we’re waiting on runjob_hook. For exec_hooks, we’re waiting on the runjob_hook and the execjob_begin hook.
I’d rephrase the guidance at the bottom. Say if the none setting is used, then we are waiting on neither hook. This means that we assume the resources for the job are used for the rest of the cycle. Start off by saying if the execjob_begin hook consistently rejects the job, once the runcount > 20, the job will be held. This will not happen for the runjob hook. If the runjob hook consistently rejects the job, those resources will not be used and the system will be underutilized. You can then go on to your advice about what to do.
As a note, you still have throughput_mode=low/med/high in places throughout the design.
In the internals, don’t just say the batch requests, also say the IFL calls. Also, do you want to add something about the new IFL call which is replacing pbs_asyrunjob()?