Scheduling Cycle takes long due to specific jobs

Our scheduling cycle takes around 5 minutes. 99% of that time is taken by a small group of ‘demanding’ jobs (roughly 50) while the rest (100 or more) take the remaining 1% of the scheduling cycle time. The main problem is when the number of ‘demanding’ jobs increases and scheduling cycle takes 30 minutes.

Some of the main parameters to consider for this problem are as follow:

pbs_version = 19.1.3
backfill_prime: false ALL
strict_ordering: True ALL
fair_share: true ALL
~100 compute nodes

This is an example of a job that took 7 seconds of the scheduling cycle

03/17/2022 09:53:10;0080;pbs_sched;Job;123456.domain;Considering job to run
03/17/2022 09:53:17;0080;pbs_sched;Job;123456.domain;Fairshare usage of entity USER increased due to job becoming a top job.
03/17/2022 09:53:17;0080;pbs_sched;Job;123456.domain;Job is a top job and will run at Sat Mar 19 00:23:25 2022
03/17/2022 09:53:17;0040;pbs_sched;Job;123456.domain;Placement set normal_queu=Yes has too few free resources or is too small

Notice that from 03/17/2022 to Sat Mar 19, the scheduler will be spending 7 seconds per cycle to recalculate the starting point.

This other example, more common, is from a job that request resources that would never be available, hence it would never run

03/09/2022 10:37:37;0080;pbs_sched;Job;654321.domain;Considering job to run
03/09/2022 10:37:43;0040;pbs_sched;Sched;654321.domain;Can’t find start time estimate Insufficient amount of resource: ngpus
03/09/2022 10:37:43;0040;pbs_sched;Job;654321.domain;Error in calculation of start time of top job
03/09/2022 10:37:43;0040;pbs_sched;Job;654321.domain;Insufficient amount of resource: ngpus

On both cases

  1. Jobs demand lots of resources. Cases as example #2 are more common as the ones in example#1.
  2. Job owners are very active in regards of sending many jobs, and their fairshare plays an important role (we believe).
  3. Scheduler spends the same amount of time each cycle unless we HOLD the job (or it ends up running as example #1). Putting them in HOLD (or delete them) solves the problem, but it involves more communication with the user.

Important facts:

  1. We turn off fairshare, but behaviour did not change. Fairshare usage calculation continue to appear in the logs. The way we turn it off was by setting fair_share: false then kill -HUP . Should we restart the whole scheduler?
  2. We tested submitting jobs with the same ‘demanding’ requests from a relatively new user with no fairsahre history; the scheduler spent an instance moment (not seconds) to decide not to run.

Ideally, we would like to find the cause of this problem. If that is not possible, we then might be planning a workaround. Bellow post mention the same problem but caused by a different issue. Is their proposed solution #1 the best workaround for this problem?, if so, what should be the main characteristics of it?

Your ideas are much appreciated.

Please create a queuejob hook that would reject jobs that ask for resources that cannot be satisfied by the resources of the cluster.
For example: Cluster has 3 nodes each with 32 cores, 192GB RAM
The queuejob hook reject job requested
qsub -l select=1:ncpus=33
qsub -l select=1:ncpus=1:mem=200GB
qsub -l select=4:ncpus=32

  • it is not recommended to use backfill, help_starving_jobs , strict_ordering with fairshare. They should be setup to false

kill -HUP should work

Also, you can kill the scheduler and start the scheduler on its own, pbs scheduler is stateless (to see whether it makes any difference)
kill -9
source /etc/pbs.conf;$PBS_EXEC/sbin/pbs_sched

If you can share the PBS Configuration of your cluster, number of jobs ( single core, multi-node parallel) , number of users , scheduling configuration used , etc, then the community members might be able to help.

Also, you can increase the PBS Scheduler log verbosity and find out more.