Node grouping - config problems

Hi,

I am configuring PbsPro (v14.1.0) to use two kinds of nodes, labeled “compute” and “io”. Basically, “compute” are reasonably new HPC nodes, while “io” are older.

The idea is that:

A. No job should span the two groups (each job must fit within one of the groups).
B. A job may explicitly ask for compute-nodes and get those
(In principle a job may also ask for io-nodes, but that is not critical).
C. Jobs not asking for compute-nodes should run on io-nodes, if any are available (this could be a priority thing).
D. If (for some reason) no io-nodes are available, then all jobs must be able to run on the compute nodes.

Here is what I find so far (on the present test system #compute > #io)

I have tried to use server settings node_group_enable and node_group_key to divide the nodes into two groups. That seems to work, but
a. Jobs tend to run on the smallest of the two sets (happens to be the io nodes on the present setup, but I would like to be able to control it). This feature seems documented, but I am not sure how to get around it).
b. If I ask for more nodes than what is presently in “compute”, then I will get all the compute nodes plus extra nodes from the io-group. I would rather have such a job request hang or even fail rather than spanning the two groups.

I have tried configuring the above by defining a new resource (as string or string_array), and then setting the resource_available for each node. Example:

qmgr -c “create resource nodetype type=string, flag=h”
vim /var/spool/pbs/sched_priv/sched_config - allow scheduling to be based on nodetype
qmgr -c “set node dn008 resources_available.nodetype="io"” # For all io-nodes
qmgr -c “set node dn101 resources_available.nodetype="compute"” # For all compute-nodes

As an alternative, I have tried to define resources “io” and “compute” (as type long or boolean) and define the node as eg.:
io-nodes:
qmgr -c “set node NOD resources_available.io = True”
qmgr -c “set node NOD resources_available.compute = False”

However, if I then specify to get, say, one compute-node:

qsub -l compute=true …

then I get a reply (qstat job comment) noting:

comment = Can Never Run: Insufficient amount of queue resource: compute (True != False)

Similar if I try -l io=true.

However, if I use eg. “-l compute=false” then the job starts on an io-node. So what’s up with that?

Even with this “negated” setting, asking for more nodes that what is present in a single set, then I get allocated extra nodes from “the other set”, which is not what I want.
Presently, I have a setup with 5 compute nodes, so if I ask for six - like this:

qsub -I -l io=false,nodes=6

Then I get the 5 compute nodes plus one io-node.

I imagine that this is a “pretty common” setup to try to make.
Does anybody have idea/links on what to try?
I have already browsed the admin and reference guides (not read cover-to-cover, though).
And I have tried to follow this recipe:
http://forum.pbsworks.com/index.php?/topic/176-node-properties-in-pbs-pro/
without much success.

Eventually, I might want to try to let the queue choise select the node group, but for now I just want to make it work from qsub.

All kinds of hints, links and help will be much appreciated.

Thanks,

/Bjarne

A few comments…
You might want to try the following:
1 to sort the groups as you want, there are some tricks to play with the system. You can use fake node, create and then offline vnodes on one of node from the small set. This way you would get control on the order on the groups.
2 you need to think you really want to do the above exactly… If you set default resource for node to IO node and allow user to explicitly ask for compute nodes. This way job would not span the groups but you lose the priority of the groups.

3 if you do not do 2, then a job would not have default node group requested, and it would span the groups if it is large enough.

To summarize you might need to comprise, unless you use some hooks which might not be available in the open source PBS Pro, I am not quite sure.

HTH.

吴光宇|怀曦智能科技

Thank you for the comments.
I am not quite sure what/how to actually configure what you describe, and some the setup might actually be simpler, than what you expect.

I am not even sure about what you mean by “sort” here. I already added resources: either as a string/string_array to divide the nodes into two groups.

This I simply do not get.

Yes, that would be ideal. And I won’t mind if all users chose to use the compute nodes - we are on an operational system with no rogue users - ie. all jobs are more or less spawned by the admin group or somebody in the same room. At some point we may take down the io-nodes, and preferably all jobs should then simply use the compute nodes.

I’ll dig some more for setting default resources.

That does not sound promising. The two kinds of nodes have different hardware and are on separate leaf switches, so I really never want any job to span the groups. Each job must fit in one group or be deferred. Obviously, I may set a limit on the max number of nodes that any job can use (equal to or lower than the number of nodes in the largest group).

I really thought that node groups would allow this. My present setting includes:

create resource nodetype
set resource nodetype type = string_array
set resource nodetype flag = h
set server node_group_enable = True
set server node_group_key = nodetype

and then using two different flavors for nodetype.
In addition, I have the “compute” resource as mentioned above. Presently, it seems better to define it as integer (long) rather than boolean, because I can then request exactly N compute-nodes with

-l select=N:compute=1

However, I still have not figured out how to ensure that jobs never span the node groups. Or how to make a unified interface to either force a job onto the compute group vs. default to the io but be allowed to use the compute nodes.

Hints are still welcome.

/Bjarne

Hello Bjarne,
Let me provide some assistance. I’ll refer to A-D in your original post.

First off, the flag you want to set is do_not_span_psets (qmgr -c ‘s sched do_not_span_psets=True’). This will make sure no job spans placement sets. Such jobs will get a ‘Can Never Run’ comment. This enforces A.

I’d suggest against using the compute/io resources. Users who want to request compute nodes can just request the placement set resource (nodetype in your case). They’d request qsub -l select=N:nodetype=compute. You can also add a default_chunk.nodetype=io. Doing so will break D though. The placement set sort should be sufficient, so setting the default_chunk should not be required. This enforces B.

PBS enforces a smallest set first sort on placement sets. There is no way to change this sort. Now with that being said, you can play with the scheduler’s world view a bit and affect the sort. This is what wgy was getting at. You create fake vnodes (with ncpus and memory) and then offline them. They won’t be used, but they will be considered in the calculation for the placement set sort. If at any point your io placement set grows larger than your compute set, you can create fake vnodes and add them to the compute set. This enforces C and D.

The reason your io/compute booleans were not working was because of the way you requested your job. Requesting qsub -lresource=True is requesting the resource at the queue or server level. I suspect you didn’t set a resources_available.io at the queue level. This would default the boolean to false. The correct way to make your request is qsub -lselect=N:io=True. I suggest against using the old nodes syntax. The select syntax is much more powerful.

I hope this helps!
Bhroam

1 Like

Hi Broam,

Thank you for the detailed answer, which I have tried out.

Thanks. This works as described.
For reference (other users reading), the behaviour is described in AdminGuide (v13.1) §4.8.33.4.ii.
Upon submission, the sched-log spits out (example):

…;pbs_sched;Job;.;Considering job to run
…;pbs_sched;Job;.;Can’t fit in the largest placement set, and can’t span placement sets
…;pbs_sched;Job;.;Job will never run with the resources currently configured in the complex

The job status gets a comment (seen by eg qstat) reading:

comment = Can Never Run: can’t fit in the largest placement set, and can’t span psets

The job is still stuck in the queue, though. I will try to look for a setting, where it may be possible to automagically kick such jobs totally from the queue. (This is a lower-priority issue, but as none of our real jobs are started interactively, I would like an automatic failure to happen).

Actually, this happens to be perfectly OK in our case. Firstly nodes are not identical (compute are 20-core haswell to be used for eg HPC/MPI stuff, while the io are older 12-core nehalem systems). Secondly, we do not have any lusers on the system. All jobs are under control of a small hand-full of developers setting up operational job systems. Some jobs really must run on the compute nodes, while others could run on any group. I could probably set the selection up based on two different queues, but both kind of jobs will be issued by the same operational user.

This is a cute trick, and I will keep it in mind - plus write it on our internal howto/tricks list for PBSpro. For the foreseeable future, the group which I want to be the default happens to be the smallest.
As we are aiming to power-down nodes, which are not in use, I expect that we will have to explicilty tell pbs (server) that those fake nodes are truly “dead” and not to be woken (plus not to complain about them not waking). I’ll deal with that if the issue arises. +1 for raising this caveat.

I agree that the select syntax is more powerful. However, we have a few “specialities” to consider.
A. We will enforce job-exclusive use of nodes, so we will never have several jobs (or users) on a single node.
B. Some jobs (in particular IO-intensive jobs) just want exclusive access to “a node” (or a node count) to do “their thing”. These jobs interfere with other jobs as they use loads of system resources, so I still want a way to state “give me two nodes” - and not just get scheduled to a single vnode because the resources seem to fit.

To resolve this I have now created a new resource (“infiniband” - type long), and given each node one of these. Then I can ask for -lselect=2:infiniband and get exactly 2 nodes. The nodect limit (http://community.openpbs.org/t/problems-with-resources-max-nodes) then still is enforced and works.

Thanks a lot guys!

/Bjarne

-lplace=scatter (or in some cases vscatter) is your friend (assuming you specify one chunk per host/vnode you want).

Thanks, Alexis.

Yes, that will work - especially if I modify it to, say, -lplace=scatter:exclhost to avoid interference from other jobs. (Even though I have the MOMs configured to be job-exclusive, the scheduler seems to be unaware of this, so in order for it to “do its thing” (primarily estimating start times and effectively do backfilling) I have success telling it explicitly that the nodes will/should be exclusively for a job.

I am sorry that I did not cross-post to this thread, when a similar response was issued on a slightly different topic, and this got me looking for the -lplace syntax.

Many thanks and regards,

/Bjarne