New node placement algorithm

bhroam · April 26, 2018, 1:42am

I’ve written a design document for a new node placement algorithm that is being developed. It is much faster than the normal node placement algorithm, but can only be used in certain circumstances. It gets its speed by simplifying the number of variables the scheduler needs to keep track of when placing jobs on nodes.

Take a look and let me know what you think.

https://pbspro.atlassian.net/wiki/spaces/PD/pages/325255181/Node+Bucketing+-+a+new+simpler+node+placement+algorithm

gmatthew · April 26, 2018, 5:17pm

I like it.

You point out that buckets as currently envisioned are incompatible with node-sorting techniques (e.g. avoid_provisioning). This is not an inherent characteristic of buckets, one could imagine allowing buckets to be defined with additional node attributes (e.g. current_aoe) that would allow avoid_provisioning to work (or other features). In the limit, one could broaden the description of buckets such that you end up with as many buckets as nodes, each bucket containing one node. The scheduling performance benefit of buckets comes primarily from minimizing the number of buckets, so it feels to me like a new scheme of tunables might be waiting for us to define. Tunables that would give site admins control over the sweet spot between scheduling feature availability and scheduling performance.

bhroam · May 5, 2018, 1:24am

I’ve updated the design document. Placement sets are now supported. If they are in use, each placement set has its own group of node buckets. The algorithm is run N times, one for each placement set.

There are a couple new restrictions. The algorithm can’t be used if the job is suspended or checkpointed. When jobs are suspended or checkpointed, we create a special select statement for the job. Each chunk in the select statement has vnode=vn to make sure we place the job back on the resources it was originally running on. There is already a restriction for select=vnode jobs, this is just a special case of that restriction.

The other new restriction is the algorithm can not be used on complexes with multi-vnoded hosts. A job can request a large chunk where the resources are spread across multiple vnodes of a single host. The bucket algorithm can not do this resource spreading. It can’t determine if chunks require their resources spread across multiple vnodes.

bhroam · May 12, 2018, 1:19am

I updated the design again. The bucket algorithm can now be used for place=free jobs (excl is still required). This means you can request -lselect=100:ncpus=1 -l place=free:excl and allow the scheduler to freely place your chunks. The design was updated in how this was accomplished.

Topic		Replies	Views
Created node bucket messages in sched_logs Users/Site Administrators	6	566	August 11, 2021
Can we take control of placement set selecting order? Users/Site Administrators	1	904	May 2, 2018
Placement set questions Users/Site Administrators	11	1204	August 9, 2021
Filtering nodes per the job request Developers	11	2630	June 29, 2018
How to scatter jobs over vnodes? Users/Site Administrators	30	8142	May 19, 2020

New node placement algorithm

Related topics