Running large jobs efficiently on cluster with various node size

mikescchen · September 22, 2020, 1:56am

Hi,
We have a cluster with nodes that has different #cpus mixed together.
With the current settings, it was found that the smaller (that requests fewer #cpus) jobs occupied the nodes.
So when a larger job was queued, the nodes do have free cpus, but not enough to run the larger job in one node, thus keep it waiting in the queue.
Is there a general suggestion to improve the waiting time and efficiency in such cases?
Like, set the nodes that has fewer #cpus with higher priority, leaving the larger nodes to run the larger jobs?
Or maybe to make the larger job has higher priority?

Mike

adarsh · September 22, 2020, 6:24am

Note:

All the jobs should request walltime to have good scheduling policy.
the smaller jobs get backfilled (in front of) larger jobs and push them in the future, if you have walltime defined with strict ordering and backfill depth, then this can avoided.
please check help_starving_jobs option of the sched_config?
or
please enable strict_ordering with backfill_depth set to 4
or
3.set eligible_time_enable to true and implement job_sort_formula

Topic		Replies	Views
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1626	June 18, 2019
Provide Priority on high priority jobs when resources are fully occupied Users/Site Administrators	1	409	July 28, 2023
Allowing queues to use nodes based on time Users/Site Administrators	5	577	March 27, 2023
Job Submission by Memory Users/Site Administrators	5	1367	February 9, 2018
CPU/GPU node, job schedule Users/Site Administrators	1	472	March 1, 2022

Running large jobs efficiently on cluster with various node size

Related topics