Our scheduling cycle takes around 5 minutes. 99% of that time is taken by a small group of ‘demanding’ jobs (roughly 50) while the rest (100 or more) take the remaining 1% of the scheduling cycle time. The main problem is when the number of ‘demanding’ jobs increases and scheduling cycle takes 30 minutes.
Some of the main parameters to consider for this problem are as follow:
pbs_version = 19.1.3
backfill_prime: false ALL
strict_ordering: True ALL
fair_share: true ALL
~100 compute nodes
This is an example of a job that took 7 seconds of the scheduling cycle
03/17/2022 09:53:10;0080;pbs_sched;Job;123456.domain;Considering job to run
03/17/2022 09:53:17;0080;pbs_sched;Job;123456.domain;Fairshare usage of entity USER increased due to job becoming a top job.
03/17/2022 09:53:17;0080;pbs_sched;Job;123456.domain;Job is a top job and will run at Sat Mar 19 00:23:25 2022
03/17/2022 09:53:17;0040;pbs_sched;Job;123456.domain;Placement set normal_queu=Yes has too few free resources or is too small
Notice that from 03/17/2022 to Sat Mar 19, the scheduler will be spending 7 seconds per cycle to recalculate the starting point.
This other example, more common, is from a job that request resources that would never be available, hence it would never run
03/09/2022 10:37:37;0080;pbs_sched;Job;654321.domain;Considering job to run
03/09/2022 10:37:43;0040;pbs_sched;Sched;654321.domain;Can’t find start time estimate Insufficient amount of resource: ngpus
03/09/2022 10:37:43;0040;pbs_sched;Job;654321.domain;Error in calculation of start time of top job
03/09/2022 10:37:43;0040;pbs_sched;Job;654321.domain;Insufficient amount of resource: ngpus
On both cases
- Jobs demand lots of resources. Cases as example #2 are more common as the ones in example#1.
- Job owners are very active in regards of sending many jobs, and their fairshare plays an important role (we believe).
- Scheduler spends the same amount of time each cycle unless we HOLD the job (or it ends up running as example #1). Putting them in HOLD (or delete them) solves the problem, but it involves more communication with the user.
Important facts:
- We turn off fairshare, but behaviour did not change. Fairshare usage calculation continue to appear in the logs. The way we turn it off was by setting fair_share: false then kill -HUP . Should we restart the whole scheduler?
- We tested submitting jobs with the same ‘demanding’ requests from a relatively new user with no fairsahre history; the scheduler spent an instance moment (not seconds) to decide not to run.
Ideally, we would like to find the cause of this problem. If that is not possible, we then might be planning a workaround. Bellow post mention the same problem but caused by a different issue. Is their proposed solution #1 the best workaround for this problem?, if so, what should be the main characteristics of it?
Your ideas are much appreciated.