I have been trying to figure out why qsub wasn’t giving me the behavior I was expecting. I started out thinking this was a hook problem, but I removed all the hooks and it still behaved the same way. I think I have tracked it down to the use/behavior of placement sets. I wanted to confirm some things.
First, I created a custom resource called system
:
#
# Create and define resource system
#
create resource system
set resource system type = string
set resource system flag = h
I set the value on each vnode like this:
set node <vnode name> resources_available.system='crux'
I then tried:
qsub -l select system=polaris
.
Since none of the nodes have system=polaris
, I expected the job to be held, but no, it ran. After much reading, trial and error, I did this:
set server node_group_enable = True
set queue workq node_group_key=system
I then ran the same qsub as above, but it still ran. However, when I changed the last node to have system=polaris
it did run the job on that node. So, with that background, here are my questions:
- Is it correct that
-l select <custom_resource>=<value>
is completely ignored if placement sets are not enabled and the resource is set as a key? -
node_group_enable
is server wide? You can’t/don’t enable it at the queue level? - This seems to be an “optimization” or “suggestion”. I specified a resource constraint that couldn’t be met. I was surprised it ran. Is there a way to make placement sets behave that way?
More generally, I am trying to restrict queues to use only specific nodes. I have a polaris queue and I want it to only use nodes that have system=polaris
. I have other similar queue restrictions, but they are just variations on that. Are placement sets the way to do that or is there some completely different mechanism I should be trying?
I looked at Placement sets for fast vs. slow switches problems, but that just prioritizes use. I want a hard separation. Similarly, I looked at Node grouping - config problems. I tried adding set sched do_not_span_psets = True
but when I removed the system=polaris
it still ran the job on a node without that resource.
I would appreciate any thoughts you might have.