I’m proposing 2 enhancements:
- A sched object attribute which will control how often the scheduler sends job attribute updates to the server
- Shifting the responsibility of accruing eligible time from server to scheduler, this will allow 1) because then delaying accrue_type updates won’t delay the accrual of eligible_time for a job
The motivation is performance. In my test setup, with 100k jobs, 50k ncpus, node grouping on and eligible time on, the scheduling cycle took around 25 minutes. With attribute throttling, it took only around 8.5 minutes. That’s ~3 times faster.
Sched sends updates for the following job attributes:
ATTR_estimated.soft_walltime: updated when a job exceeds its ATTR_l.soft_walltime, if soft_walltime has been set for the job.
ATTR_accrue_type: updated if the site is using eligible time and a job starts/stops accruing eligible time because it got preempted, or can’t be
run, or is an arrayjob who’s subjob was run by the scheduler. This might be the only attribute which shouldn’t be delayed.
ATTR_l.walltime: updated for jobs which are run via shrink-to-fit
ATTR_pset: updated for any job that’s run if node grouping/placement sets are used
ATTR_sched_preempted: gets unset when a previously preempted job is run
ATTR_estimated.start_time/exec_vnode: gets updated when a job is calendared by the scheduler.
ATTR_comment: gets updated when a job cannot be run, usually only once per job, not every cycle.
Users will see a delay in the job attributes being updated on their jobs, the updates will be dependent on how often sched cycles occur. The delay is customizable. For sites which don’t want any delay can choose to turn off throttling altogether, in which case the behavior will be similar to what it is today at the expense of worse performance, with one exception: eligible time today is accrued in the server, so when users do a stat, server can compute the up-to-date eligible time value and return it to them. Now, even if admins turn throttling off (i.e - scheduler sends attr updates every cycle) the value of eligible time accrued, as seen by a user, will be the value that sched sent to the server, so it can be stale, although it will be accurate at the scheduler’s end and the scheduler will schedule jobs correctly, as it does today.
Before I create a design document, I wanted to know whether the trade-off is acceptable or not. So, please provide feedback and let me know. Specifically, requesting @bhroam, @scc, @billnitzberg and @subhasisb for opinions.