Optimizing OpenPBS Job Scheduling in a Heterogeneous HPC Environment

Hello everyone :sunglasses:,

Our high-performance computing (HPC) infrastructure is facing the challenge of effectively managing job distribution across a diverse cluster of nodes. Our environment includes a mix of standard CPU, specialized GPU, and other unique hardware configurations. To maximize resource utilization and job throughput, we’re seeking expert guidance on optimizing OpenPBS job scheduling.

We’re particularly interested in best practices for:

  • Resource Allocation: Implementing strategies to match jobs with suitable hardware, setting appropriate resource limits, and prioritizing nodes based on their capabilities.
  • Scheduling Policies: Developing effective scheduling policies that accommodate diverse workloads, ensure fair resource allocation, and prioritize critical jobs.
  • Performance Monitoring: Employing tools or methods to track job performance, resource utilization, and identify bottlenecks for informed decision-making.
  • Troubleshooting: Avoiding common pitfalls and addressing challenges specific to mixed-cluster environments.

I came across to this resources/article: Job performance is lower when scheduled through pbs data aanalytics Framework, and as per them I need to simplify my queries to reduce their complexity, or wait for the budget to reset before retrying the query​.

We believe the collective knowledge of this community is invaluable in refining our OpenPBS configuration. Your insights and recommendations will be instrumental in achieving optimal efficiency and performance in our HPC environment.

Thank you :pray: for your time and expertise!!

My two cents as a retired system analyst of several years experience at a site with 1000’s of nodes (with similar, but not identical, hardware), 100’s of users, and dozens of application styles:

  • The primary goal should not be highest machine utilization. The goal is best user productivity, even when that lowers machine utilization.
  • Queueing theory says the fewer queues, the better. An exception in PBS is if the queues are used just to set default resource values. In that case, set the scheduler parameters round_robin and by_queue to false so all jobs sort into one list, irrespective of queue. You can still use the different queues to set priorities.
  • This also applies to nodes. The more distinct node types you define, the more idle nodes you’ll have. Look at nested PBS placement_set and node_group_key attributes to see if they can replace exclusive types. Or, oftentimes, it suffices to give “special” nodes a lower priority than normal nodes so PBS can use them for normal jobs, but only after all normal nodes are in use.
  • Favor wide jobs (jobs needing more nodes, CPUs, or GPUs) over narrow jobs. Otherwise the wide jobs can starve waiting for enough resources to become available. You should set strict_ordering true. If you do this, you should also enable “backfill” so short, narrow jobs can squeeze in while the wide jobs are waiting for resources (backfill_depth=NN).
  • If you are considering preemption, try everything else first. Preemption lowers overall effective utilization and reduces the productivity of all non-special users.
  • One technique we used was to have a queue with dedicated resources to be used only for short (2 hour) testing or debug runs and limited to one job per user. An external script examined the queue frequently to add appropriate nodes to the queue when needed and to free them when not. The idea is that you “spend” a few idle nodes in exchange for fast response to requests from the queue.
  • Encourage users to supply accurate walltime estimates (with a little extra, just in case things are slow that day). Better estimates allow the scheduler to backfill better.
  • If you have regularly scheduled dedicated times (e.g., monthly patching), make sure users know about “shrink to fit” jobs. The user specifies a minimum and maximum needed walltime for the job. Normally, the maximum is used, but if that would collide with a dedicated time, the scheduler checks if a shorter walltime (but still at least the minimum) would fit.
  • Speaking of regular patching, see if you can use a rolling reboot instead. That is, at the end of each job, have PBS check if the node needs to be patched. If so, automatically take the node down, apply the patches, and reboot. The cluster, as a whole, stays running.
  • Avoid “machine-size” jobs (jobs that use the whole cluster). Half of the cluster is a better maximum. First, if your cluster is big, it’s almost impossible to have all the nodes physically in good shape at the same time (there’s always a flaky DIMM or network cable). Second, while that one user might be productive during this time, all the other users are blocked. Third, if you must run machine-size jobs, do it by extending your regular dedicated times. This reduces the number of instances other users need to find something else to do.
  • Consider running node health check (GitHub - mej/nhc: LBNL Node Health Check) before each job. There is no point to starting a job on a node if the job is going to blow up through no fault of its own (e.g., /var or /tmp is full).

As an aside, don’t try to hide information from users. If your security policies allow it, let your users see information about all jobs, reservations, and nodes. Users often notice things are “not quite right” before you do.

Also, there will always be users who “game the system”. The temptation is to write code to block the specific thing they are doing. It’s better to explain to the user what issue they are causing and work with them to come up with a more cooperative solution to their issue. If that doesn’t work, talk to the user’s manager. When that doesn’t work, then you write code.

You write:

Resource Allocation: Implementing strategies to match jobs with suitable hardware, setting appropriate resource limits, and prioritizing nodes based on their capabilities.

Only the users know what resources their jobs need. If you get them to specify what they want, PBS will take care of finding the right type of nodes.

More ramblings after a morning ramble.

Don’t share nodes among jobs. After all, if an app doesn’t use all the resources on multiple nodes, is it really an HPC job? Realistically, though, you will probably have some users who don’t have big jobs, but have thousands of smaller ones (e.g., Monte Carlo search or machine learning). The individual jobs probably use only a few cores and maybe a GPU, all within a single node. You could pack two or four such jobs onto a node, with only minor interference among the jobs.

There are tools to let the user bundle groups of identical tasks into one PBS job, where the tool takes care of spreading the tasks across the allocated nodes and cores. GNU Parallel (GNU Parallel - GNU Project - Free Software Foundation) is one example. There are also multiple python implementations, if your users are more familiar with python.

If you decide to share nodes, you need to become familiar with the pbs_cgroups hook. The hook does a pretty good job of isolating jobs on a shared node, reducing the instances where one job steals resources from another. The users need to carefully specify all of the resources they need (cores, memory, GPUs, etc), because that’s all the hook will let them have. Even with pbs_cgroups, there will be some interference.

There are downsides to jobs sharing nodes.

  • Unless you get perfect job packing, shared nodes will often have idle resources.
  • Shared nodes make rolling reboots more difficult. You must wait for the last job on the node to finish before starting the update/reboot. Which results in even more idle resources.
  • Then, there is billing. It is easy to bill for exclusive nodes–you charge for the whole node for the whole walltime. With shared nodes, you need to come up with billing rates for each component of the node: cores, memory, GPU, network bandwidth. Then you charge only for the resources you allocate to the job. As mentioned above, shared nodes are often not fully used, which means you bill less than you would for a full node.