Hi @adarsh and @bhroam for your replies
Here’s the output of the sched_config command:
round_robin: False all
by_queue: True all
strict_ordering: false all
help_starving_jobs: false all
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
job_sort_key: “job_priority HIGH” ALL
node_sort_key: “ncpus HIGH unused” ALL
provision_policy: “aggressive_provision”
resources: “ncpus, mem, arch, host, vnode, aoe, eoe, LIC1, LIC2, LIC3, LIC4, LIC5, LIC6, LIC6, LIC7, LIC8, LIC9, LIC10, LIC11, LIC12, LIC13, LIC14, LIC15, LIC16, LIC17, LIC18, LIC19, LIC20, LIC21”
load_balancing: false ALL
smp_cluster_dist: pack
fair_share: true ALL
unknown_shares: 1
fairshare_usage_res: ncpus*walltime
fairshare_entity: Account_Name
fairshare_decay_time: 00:30:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
dedicated_prefix: ded
please note that I’ve renmaed the resources from license names to LIC to avoid exposing our license usage.
I’ve changed the provision_policy as @adarsh suggested to “avoid_provision” and I will see if that helps, so far after about 30 mins there isn’t much change.
as you can see we have node_sort_key set to “ncpus HIGH unused” - we want our jobs to be equally distributed across all nodes when it comes to ncpus.
Our server’s backfill_depth is set to 0:
qmgr -c ‘print server’ | grep backfill_depth
set server backfill_depth = 0
@bhroam , yes we are using “unused” in our node_sort_key, we have recently migrated from v19.3 to v20.0.1 , I’ve noticed section “4.9.50.3 Sorting Vnodes According to Load Average” do you suggest we follow the periodic hook approach instead?
Any help will be appreciated.
Thanks,
Roy