Created node bucket messages in sched_logs

I don’t think changing will provisioning_priority will make a difference. This is just a way to turn off the node bucket node search algorithm. That can only help speed things up. The node bucket algorithm kicks in if you request your nodes with -lplace=excl. I’d revert that change.

Do you have strict_ordering on? Having backfill_depth=0 will cause your system to idle. If you don’t have strict_ordering on, it won’t make a difference until jobs start starving (wait time > 1day). After that, it’ll start idling your system waiting for the starving jobs to run. Regardless of that, if it kicks in and starts idling your system, it will only make your cycle faster. Once one job can’t run, the scheduler just ignores the rest.

Having node_group_key with unused really only slows things down with a significant number of nodes (many thousand). If your cluster is smaller than that, feel free to use it. It doesn’t sound like it was the cause if your issues anyway.

At this point you’re going to have to look through the scheduler logs and see where the scheduler is spending all of its time. If there is a significant amount of time between the start of the cycle and the first ‘Considering job to run’, then it’s spending a lot of time querying the universe. This means you have a significant sized system or workload (like 100k+ jobs) or there is a networking issue. If you see time between Considering job to run and Job run, then that also leads me to think you have a network issue between the scheduler and server. There is likely a gap in timestamps between log lines somewhere in the log. Once you figure out where it is, raise the debug level, and look again and see if you can narrow down to what exactly the scheduler is doing.

Bhroam

1 Like