I have been trying to figure out why qsub wasn’t giving me the behavior I was expecting. I started out thinking this was a hook problem, but I removed all the hooks and it still behaved the same way. I think I have tracked it down to the use/behavior of placement sets. I wanted to confirm some things.
First, I created a custom resource called
# # Create and define resource system # create resource system set resource system type = string set resource system flag = h
I set the value on each vnode like this:
set node <vnode name> resources_available.system='crux'
I then tried:
qsub -l select system=polaris.
Since none of the nodes have
system=polaris, I expected the job to be held, but no, it ran. After much reading, trial and error, I did this:
set server node_group_enable = True set queue workq node_group_key=system
I then ran the same qsub as above, but it still ran. However, when I changed the last node to have
system=polaris it did run the job on that node. So, with that background, here are my questions:
- Is it correct that
-l select <custom_resource>=<value>is completely ignored if placement sets are not enabled and the resource is set as a key?
node_group_enableis server wide? You can’t/don’t enable it at the queue level?
- This seems to be an “optimization” or “suggestion”. I specified a resource constraint that couldn’t be met. I was surprised it ran. Is there a way to make placement sets behave that way?
More generally, I am trying to restrict queues to use only specific nodes. I have a polaris queue and I want it to only use nodes that have
system=polaris. I have other similar queue restrictions, but they are just variations on that. Are placement sets the way to do that or is there some completely different mechanism I should be trying?
I looked at Placement sets for fast vs. slow switches problems, but that just prioritizes use. I want a hard separation. Similarly, I looked at Node grouping - config problems. I tried adding
set sched do_not_span_psets = True but when I removed the
system=polaris it still ran the job on a node without that resource.
I would appreciate any thoughts you might have.