Filtering nodes per the job request

Altair asked me to start a new post on the topic of PP-507. A post to start clean and take a look at node filtering.

I talked with Dale Talcott and we came up with a list of high-level descriptions of qsub select scenarios. The scheduling context we’re dealing with:

a. Jobs do not share nodes.

b. We have 5+ different node models in our cluster, they differ by number of cores and amount of memory.

At present our users must specify which model(s) their job needs, they may use 2 or more models in a single job.



High-level description of single chunk scenarios

  1. Select N nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).

    e.g. select=22:ncpus=16

  2. Select N nodes, PBS must choose nodes that are all the same model.

  3. Place N ranks, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).

    e.g. select=352:ncpus=1

  4. Place N ranks, PBS must choose nodes that are all the same model.




High-level description of two chunk scenarios

  1. chunk1: Select M nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).
    chunk2: Select N nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).

  2. chunk1: Select M nodes, PBS may choose any nodes that satisfy the request (no restriction on which models are chosen).
    chunk2: Select N nodes, PBS must choose nodes that are all the same model.

  3. chunk1: Select M nodes, PBS must choose nodes that are all the same model.
    chunk2: Select N nodes, PBS must choose nodes that are all the same model (but may differ from chunk1).

  4. chunk1: Select M nodes, PBS must choose nodes that are all the same model.
    chunk2: Select N nodes, PBS must choose nodes that are all the same model as chunk1.

        1. Same as 5-8, but chunk 1 is placing M ranks and chunk2 is placing N ranks.

Given those scenarios, how might a user put together their select statement to get what they're asking for?




Additional considerations

i. We are working toward a new filesystem feature for our users that will take “extra” memory on their nodes and make it available as a distributed ramfs. This is likely a straightforward arrangement when jobs request N nodes, users just need to ask for enough memory to handle their rank/process needs and the ramfs. It gets more complicated when users are instead asking for N ranks to be placed as in 9-12 above. We currently plan to place the burden on users to craft their select in the right way, but we wonder if Altair has suggestions on how this could be made easier for users with current PBS and if there’s some future work here to make it yet easier/more intuitive.

-Greg

  • You can use “queuejob” hook to craft the user’s select statement based on the topology of your cluster or you can create certain select profiles based on the request resource by the user.
  • You can reject the jobs via queuejob hook , if their request is not per your site policy ( a wrapper script would be useful to have more control and create a meaningful select statement)
  • ramfs request can be based on mom_dyn_res ( mom dynamic resources ) , check this section from the PBS Pro Administrator guide - 5.13.5.1 Dynamic Host-level Resources

Models and Nodes:

  • you can tag the nodes with a custom resource (string_array, host level resource ) that can support different model
    – you can craft a select statement as below
    qsub -l select=10:ncpus=20:model=Test:10:ncpus=10:model=NoTest – /bin/sleep 100

Sorry, if in case i have not understood your problem correctly or not on the same page as you described.

Thanks for the pointer to mom_dyn_res. We currently use a custom resource for the node model, but in order to support the scenarios I provided there will need to be some change made to the scheduler (at least).

Hi Greg,

At first, thank you so much for posting your use cases on the forum. It will really help kick start the discussion on this feature.
I have read the use cases you mentioned and have following observations:

For use cases 1, 2, 3 and 4, I think PBS can do 1 and 3 even today without any change and it can also do 2 and 4 with specific job wide placement spec being requested.
For use cases 5 and 9, one can do this with PBS by not providing any specific placement and PBS will then choose the first set of nodes it can find.
For use cases 6 and 10 it seems like what we will need is a chunk level placement spec. Like, a user can specify that he/she needs a chunk to run on nodes with same resource value (place=group=model).
Use cases 7 and 11 are I think tricky ones. You mention that they may differ between chunks. If in some cases it is ok if the models are same then in this case users can specify placement set for both the chunks.
Use case 8 and 12 can again be achieved by giving job wide placement spec and then all chunks will run on the same model.

I know that there is nothing like chunk wide placement set present in PBS today but by looking at the use cases I felt maybe what is needed here is a chunk wide placement set rather than a filter (especially for cases where you want model to be same across chunk)

Please let me know what you think.

Sounds good, I’ll need to read up and play around more with placement specs.

Sounds reasonable.

I don’t quite follow - what is trickier about 7 and 11?

-Greg

I did not call it out correctly. Initially I thought that both the chunks must differ on model but then I realized you mentioned ‘may’. If the requirement is that they should differ between chunks then it becomes interesting because while finding a solution for one chunk we will have to make sure we do not use the same placement set as used in the previous one.

Ahh, gotcha. At this time we don’t need a mechanism to enforce different choices.

Just wanted to point out that it’s critical for this feature to work with scheduling buckets (New node placement algorithm).

Continuing the discussion on this thread.

I disagree @mkaro. What @gmatthew is requesting is not PP-507 (node filtering). He’s requesting node grouping at the chunk level. This can not be achieved by culling the node list based on certain criteria.

To this end, I’m going to post here. If we need to move this discussion elsewhere, then we can create a new thread.

Let me go over his use cases one at a time

With no restrictions, this is standard PBS placement

This can be achieved with the place=group request. If a job requests -lplace=group=model, then all chunks will be placed on the same model type

Again, with no restrictions, this is standard PBS placement

Since this is at the job level, it can be achieved with -lplace=group=model. To PBS a chunk is a chunk. It can be a rank or a node sized chunk.

To PBS this is the same as #1 or #4. It is a set of chunks that need to be placed without any restrictions

Now we’re getting into interesting territory. The finest grain grouping PBS can do today is at the per-job level. This is grouping at the chunk level

Once again per-chunk grouping, but two different chunks that can be placed on different models.

This is per-job grouping. It can be achieved with -lplace=group=model

I’m not sure I fully understand 9-12. To me they seem to be the same as 5-8. If N and M are two arbitrary numbers, then saying N+M or M+N seem to be the same to me. Please correct me if I am mistaken.

I was having a conversation with @billnitzberg about per-chunk grouping and some questions came out of it.

  • When PBS is choosing the model to run a job/chunk on, which model should we attempt first? What type of model sorting should happen? Does it matter? If we use the standard placement set sort, it will be smallest to largest by ncpus and mem.
  • What are you trying to achieve with this feature? Job performance? Correctness? Overall Utilization?
  • Just curious, what type of jobs require two different chunks individually grouped on model?
  • Will you ever need multiple chunks grouped on the same model. Something like (chunk1+chunk2)+(chunk3+chunk4) where chunk1+chunk2 are placed on a single model and chunk3+chunk4 are placed on another model. This can help define syntax.
  • How often will this be used? How much of a performance penalty will you be willing to accept for this grouping feature? Grouping is basically a node search N times on smaller sets of nodes. This can get expensive (even with node buckets).

Thanks,
Bhroam

The difference is that 5-8 deal in nodes and 9-12 deal in ranks. You point out that, to PBS, a chunk is a chunk, so let’s assume this difference is unimportant.

Sorting should happen as it already does.

We are moving away from our current requirement that users must specify the model for every chunk. If a user does not specify model and accepts whatever model(s) are assigned by PBS then it should improve overall utilization and reward those users by allowing them quicker access to nodes that have come free.

The standard use case that requires two different chunks individually grouped on model is chunk1 asking for large memory resources and chunk2 asking for normal memory resources. In general we assume chunk2 can be satisfied using different models, but it may also be the case that chunk1 can be satisfied using different models. It would be good to allow the user to specify grouping for the chunks as needed.

I don’t think our site will need this.

The majority of our jobs will not need per-chunk grouping. We’re willing to search a few more buckets to help out those jobs that need it. :slight_smile:

-Greg

Yeah, to PBS a chunk is a chunk. Now of course a job requesting nodes or a job requesting ranks might have a different place spec (scatter vs free). From your current use cases, only grouping is needed at the chunk level. Will you need the scatter/free placement as well?

If the difference is between memory requests, is there a reason the normal ‘mem’ resource isn’t enough? Or is it that certain models have larger amounts of memory. You want to place just the chunks that require the larger amount of memory on those models instead of forcing all the chunks of the job to go on the larger memory models.

If the majority of your jobs will not need per-chunk grouping, then per-job grouping might work for you in a pinch. Per-job grouping can satisfy all your use cases in a sub-par way. In the use case where one chunk has grouping and the other chunk does not, PBS has freedom of where to place the second chunk. With per-job placement, it just limits that freedom to the same group as the first chunk. In the case where you have two chunks which need to be grouped on model, but the model may be different. Once again, the scheduler has the freedom of where to place each of the two chunks. With per-job placement, it is once again sub-par and limits that freedom to having both chunks on the same model.

I don’t mean that this isn’t a valuable enhancement. I’m just pointing out that you can kind of get what you want today, while you wait for us to do a better job of placing chunks in groups.

Bhroam