Created node bucket messages in sched_logs

Hi,
I am trying to debug performance issues in our PBS clutser, we are getting lots of Scheduling cycle timeout and high latency for qstat/qsub commands.

I’ve enabled the sched_logs verbosity to the max and came across the following message (and its varients):

Created node bukcet ncpus=32:mem=772173mb:accelerator=False:Preemptable=False

I’ve noticed that one the scheduler/server behaves very well (low latency) then I have much more of these messages and when the scheduler/server behaves bad (scheduling cycle timeouts, high latency) then we only get a few of these.

Can someone please shed some light about these messages and offer help to further debug/fix the issue?

Thanks,
Roy

Could you please share the output of below command:
source /etc/pbs.conf;cat $PBS_HOME/sched_priv/sched_config | grep -v “#” | grep ^[a-zA-Z]

Also, if you could try to update the provision_policy in your sched_config and do a kill -HUP , see whether it helps the situation
provision_policy: “avoid_provision”

Reference: New node placement algorithm

A node bucket is a collection of nodes which are similar. The scheduler will create them regardless if they are used. It can’t tell at query time if they’ll be needed, so it does. They do not take a long time to create, so that isn’t your problem.

The resources the scheduler uses for scheduling are on the sched_config resources line. The scheduler will then optimize this list every cycle based on what resources are actually requested. This optimized list can differ between one cycle and the next. This can change the number of buckets. If you have jobs requesting a larger number of different resources, the number of buckets can go up. I highly doubt this is your problem though. It probably is just an indicator that you have a lot of different jobs in the system.

What is your backfill_depth? An excessively high backfill_depth can slow down scheduling. If it is really high, consider setting opt_backfill_fuzzy to high.
Are your sorting via node_sort_key unused? That forces the scheduler to resort the nodes often.

Just to name a few.

Bhroam

Hi @adarsh and @bhroam for your replies
Here’s the output of the sched_config command:
round_robin: False all
by_queue: True all
strict_ordering: false all
help_starving_jobs: false all
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
job_sort_key: “job_priority HIGH” ALL
node_sort_key: “ncpus HIGH unused” ALL
provision_policy: “aggressive_provision”
resources: “ncpus, mem, arch, host, vnode, aoe, eoe, LIC1, LIC2, LIC3, LIC4, LIC5, LIC6, LIC6, LIC7, LIC8, LIC9, LIC10, LIC11, LIC12, LIC13, LIC14, LIC15, LIC16, LIC17, LIC18, LIC19, LIC20, LIC21”
load_balancing: false ALL
smp_cluster_dist: pack
fair_share: true ALL
unknown_shares: 1
fairshare_usage_res: ncpus*walltime
fairshare_entity: Account_Name
fairshare_decay_time: 00:30:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
dedicated_prefix: ded

please note that I’ve renmaed the resources from license names to LIC to avoid exposing our license usage.

I’ve changed the provision_policy as @adarsh suggested to “avoid_provision” and I will see if that helps, so far after about 30 mins there isn’t much change.
as you can see we have node_sort_key set to “ncpus HIGH unused” - we want our jobs to be equally distributed across all nodes when it comes to ncpus.

Our server’s backfill_depth is set to 0:

qmgr -c ‘print server’ | grep backfill_depth
set server backfill_depth = 0

@bhroam , yes we are using “unused” in our node_sort_key, we have recently migrated from v19.3 to v20.0.1 , I’ve noticed section “4.9.50.3 Sorting Vnodes According to Load Average” do you suggest we follow the periodic hook approach instead?

Any help will be appreciated.

Thanks,
Roy

Thank you for trying this out and sharing us the results.

Can you use this instead and try whether it helps
node_sort_key: “ncpus low assigned” ALL

Make sure to kill -HUP after making any changes to sched_config file (so that the sched re-reads this file).

Thank you @adarsh but setting:
node_sort_key: “ncpus low assigned” ALL

did not made any difference

Thanks,
Roy

I don’t think changing will provisioning_priority will make a difference. This is just a way to turn off the node bucket node search algorithm. That can only help speed things up. The node bucket algorithm kicks in if you request your nodes with -lplace=excl. I’d revert that change.

Do you have strict_ordering on? Having backfill_depth=0 will cause your system to idle. If you don’t have strict_ordering on, it won’t make a difference until jobs start starving (wait time > 1day). After that, it’ll start idling your system waiting for the starving jobs to run. Regardless of that, if it kicks in and starts idling your system, it will only make your cycle faster. Once one job can’t run, the scheduler just ignores the rest.

Having node_group_key with unused really only slows things down with a significant number of nodes (many thousand). If your cluster is smaller than that, feel free to use it. It doesn’t sound like it was the cause if your issues anyway.

At this point you’re going to have to look through the scheduler logs and see where the scheduler is spending all of its time. If there is a significant amount of time between the start of the cycle and the first ‘Considering job to run’, then it’s spending a lot of time querying the universe. This means you have a significant sized system or workload (like 100k+ jobs) or there is a networking issue. If you see time between Considering job to run and Job run, then that also leads me to think you have a network issue between the scheduler and server. There is likely a gap in timestamps between log lines somewhere in the log. Once you figure out where it is, raise the debug level, and look again and see if you can narrow down to what exactly the scheduler is doing.

Bhroam

1 Like