Placement set questions

I have been trying to figure out why qsub wasn’t giving me the behavior I was expecting. I started out thinking this was a hook problem, but I removed all the hooks and it still behaved the same way. I think I have tracked it down to the use/behavior of placement sets. I wanted to confirm some things.

First, I created a custom resource called system:

#
# Create and define resource system
#
create resource system
set resource system type = string
set resource system flag = h

I set the value on each vnode like this:
set node <vnode name> resources_available.system='crux'

I then tried:

qsub -l select system=polaris.

Since none of the nodes have system=polaris, I expected the job to be held, but no, it ran. After much reading, trial and error, I did this:

set server node_group_enable = True
set queue workq node_group_key=system

I then ran the same qsub as above, but it still ran. However, when I changed the last node to have system=polaris it did run the job on that node. So, with that background, here are my questions:

  1. Is it correct that -l select <custom_resource>=<value> is completely ignored if placement sets are not enabled and the resource is set as a key?
  2. node_group_enable is server wide? You can’t/don’t enable it at the queue level?
  3. This seems to be an “optimization” or “suggestion”. I specified a resource constraint that couldn’t be met. I was surprised it ran. Is there a way to make placement sets behave that way?

More generally, I am trying to restrict queues to use only specific nodes. I have a polaris queue and I want it to only use nodes that have system=polaris. I have other similar queue restrictions, but they are just variations on that. Are placement sets the way to do that or is there some completely different mechanism I should be trying?

I looked at Placement sets for fast vs. slow switches problems, but that just prioritizes use. I want a hard separation. Similarly, I looked at Node grouping - config problems. I tried adding set sched do_not_span_psets = True but when I removed the system=polaris it still ran the job on a node without that resource.

I would appreciate any thoughts you might have.

For the scheduler to look at a resource for scheduling, it needs to be added to the resources line in $PBS_HOME/sched_priv/sched_config.
That should solve the problem.

Unfortunately I’m not certain about your specific questions, maybe @bhroam can answer them?

Hey,
So placement sets are meant for grouping, not hard partitioning. This means you create sets of nodes. All the nodes for a job will come from one of the sets, but you don’t care which set. This started out for network topology. You want all the nodes of your job on the same high speed switch, but you don’t care which high speed switch the job is placed on. As for if node_group_enable can be set on the queue? Not per-say, but if you don’t set node_group_key at the server level, then only the queues with node_group_key set will use placement sets.

I think @vstumpf hit on what the problem is. If a resource is not in the sched_config resources: line, the resource will be ignored when placing a job. The only caveat is that all boolean resources are always checked regardless if they are on the resources line or not.

A couple things about resource matching. If a resource is not on a node(resources_available.foo), then a request for that resource will not match. The node will be ineligible for the job. If a job does not request the resource at all, the node is eligible. If this is a problem, I would look into either setting a resources_default or default_chunk on the queue.

One last thing, I would only add the resources to the resources line that are required. The scheduler uses this to optimize the amount of work it has to do. Even if a resource is not requested, the scheduler has to do more work looking for the resource.

If adding resources to the resources like is too bothersome, and you don’t mind disabling this optimization, you can comment out the resources line. All resources will be checked then.

Bhroam

This was exactly the problem. Thanks!

Comments inline, but I have a couple of meta questions. On the scheduler I have experience with, a queue always had specific resources assigned to it, along the lines of a reservation. So the questions are:

  1. What is the best way to get similar behavior? Right now, I am using the system resource and have a hook that will add system= to the select statement for the queues where I want that partitioning, effectively getting the hard partition I want (at least I think it will). Is there a better way?
  2. Should I be trying to do it at all? I am trying to figure out if I am just not thinking “the PBS way” out of habit. Some of the scenarios are obvious. For instance, we had a bigmem queue, which only had the nodes with more memory. That can be accomplished by doing select mem=.
    Thoughts?
bhroam
July 29

Hey,
So placement sets are meant for grouping, not hard partitioning. This means you create sets of nodes. All the nodes for a job will come from one of the sets, but you don’t care which set. This started out for network topology. You want all the nodes of your job on the same high speed switch, but you don’t care which high speed switch the job is placed on. As for if node_group_enable can be set on the queue? Not per-say, but if you don’t set node_group_key at the server level, then only the queues with node_group_key set will use placement sets.

So even if node_group_enable is not set at the server, setting node_group_key on a queue will get the placement set behavior? So, what is the benefit of setting node_group_enable at the server? I saw something about the server creating its own psets. Is that what setting node_group_enable = True does?

I think @vstumpf hit on what the problem is. If a resource is not in the sched_config resources: line, the resource will be ignored when placing a job. The only caveat is that all boolean resources are always checked regardless if they are on the resources line or not.

Yes, that was it.

A couple things about resource matching. If a resource is not on a node(resources_available.foo), then a request for that resource will not match. The node will be ineligible for the job. If a job does not request the resource at all, the node is eligible. If this is a problem, I would look into either setting a resources_default or default_chunk on the queue.

Understood.

One last thing, I would only add the resources to the resources line that are required. The scheduler uses this to optimize the amount of work it has to do. Even if a resource is not requested, the scheduler has to do more work looking for the resource.

Understood.

If adding resources to the resources like is too bothersome, and you don’t mind disabling this optimization, you can comment out the resources line. All resources will be checked then.

Not bothersome. I just didn’t know I needed to do that.

No, you need node_group_enable for ANY placement sets to work, at server or queue level. I think Bhroam mentioned that you can skip setting node_group_key (not node_group_enable) at the server level

There are several ways to do this. You can assign nodes to queues. This is the best way to make sure that there is a 1:1 mapping from nodes to queues. No jobs outside that queue will run on the nodes. To do this you set the ‘queue’ attribute on the nodes to the queue in question. The other way is to set a default_chunk.system on the queue. This will make sure that any job that doesn’t request the system resource gets one. This does allow jobs in other queues to request the system resource for that queue, and it also allows jobs in that queue to request a different system resource. Nodes assigned to queues are probably your best bet.

We have many customers who have multiple queues for different resources. They either assign nodes to queues or use a custom resource of some kind (string/boolean). The custom resource has the the benefit of allow multiple queues to be funneled to the same nodes. If you assign nodes to a queue, only jobs from that queue can run on those nodes.

node_group_enable is a system wide switch which turns placement sets on or off. If you turn it off, no placement sets will be used (queue or system). If you turn it on, but don’t set node_group_key at the server, only the queues which have node_group_key will use placement sets. This will be a weird situation though. Some queues will be scheduled via placement sets, and others will use one pool of all nodes. The usual situation is where you have one overriding node_group_key at the server level, and then override it at the queue level where it is needed.

Bhroam

Thanks all. This has been very helpful.

I will likely not set it at the vnode level. That just seems too restrictive to me. Unless there was some situation where there would “never” be a need for some sharing, say someone bought the nodes and they were dedicated, in our world it would end up getting in our way.

Regards,

Bill

Hi Bill,

You could also take a look at section 4.9.2 in the Admin Guide for more information on associating nodes with queues.

Regards,
Sam Goosen

Sam,

Thanks for the reference. I had not found that yet. I had accidentally largely ended up there, with the exception of the default chunk in the queue. I will need to test that. It may work fine in some cases, but since it is a default, I need to see if it can be overridden and if so, it might not be restrictive enough.

I also didn’t understand the difference between doing this and placement sets until this discussion, so I was conflating documentation for those two things.

Bill

A default can definitely be overridden. If a queue has a default_chunk or resources_default will be overridden if the job requests the resource.

Now if this was a job wide resource, you can kind of do what you want (non-select resource). If you set resources_max/min/default all to the same value, then a job will be rejected if it doesn’t request the right amount or nothing (which it’d pick up the right amount). We don’t have the max/min attributes for chunks though. The best you’ll be able to do with chunks is to write a qsub hook to reject the job.

Bhroam

Thanks for the info. I need to think about the best way to do this.