I’ve written a design document for a new node placement algorithm that is being developed. It is much faster than the normal node placement algorithm, but can only be used in certain circumstances. It gets its speed by simplifying the number of variables the scheduler needs to keep track of when placing jobs on nodes.
You point out that buckets as currently envisioned are incompatible with node-sorting techniques (e.g. avoid_provisioning). This is not an inherent characteristic of buckets, one could imagine allowing buckets to be defined with additional node attributes (e.g. current_aoe) that would allow avoid_provisioning to work (or other features). In the limit, one could broaden the description of buckets such that you end up with as many buckets as nodes, each bucket containing one node. The scheduling performance benefit of buckets comes primarily from minimizing the number of buckets, so it feels to me like a new scheme of tunables might be waiting for us to define. Tunables that would give site admins control over the sweet spot between scheduling feature availability and scheduling performance.
I’ve updated the design document. Placement sets are now supported. If they are in use, each placement set has its own group of node buckets. The algorithm is run N times, one for each placement set.
There are a couple new restrictions. The algorithm can’t be used if the job is suspended or checkpointed. When jobs are suspended or checkpointed, we create a special select statement for the job. Each chunk in the select statement has vnode=vn to make sure we place the job back on the resources it was originally running on. There is already a restriction for select=vnode jobs, this is just a special case of that restriction.
The other new restriction is the algorithm can not be used on complexes with multi-vnoded hosts. A job can request a large chunk where the resources are spread across multiple vnodes of a single host. The bucket algorithm can not do this resource spreading. It can’t determine if chunks require their resources spread across multiple vnodes.
I updated the design again. The bucket algorithm can now be used for place=free jobs (excl is still required). This means you can request -lselect=100:ncpus=1 -l place=free:excl and allow the scheduler to freely place your chunks. The design was updated in how this was accomplished.