How to take a node offline

Hi, please excuse my simple questions - I’m struggling to understand PBSPro, it’s terminology and how it’s documentation is structured.

I would like to reboot a handful of nodes. I would like to mark them offline so that no new jobs are submitted to them. I don’t want to do anything to the running jobs - I’m happy to wait until those jobs finish naturally before rebooting the machines.

I’m struggling to find a couple of what I consider relatively simple tasks.

First, I’d really like to get a one line status of every node along with it’s node name. pbsnodes -a | grep ' state ' feels clumsy and lacks the node name.

As per a comment on my previous topic, I guess I’ll need to install jq and start writing a bunch of one liners to put into /usr/local/bin

Second, I’d like to take a select few offline as mentioned. On page AG-546 (section 14.4 “Administration, Managing machines” of the big guide pdf) we find references on to how to take a vhost offline but nothing for hosts, so I’ll presume that they’re the same. There are references to machine state, but those aren’t defined anywhere in the section “managing machines” nor is the definition linked so it’s hard to parse the difference between offline and down.

The docs do explain how to take a machine offline and suspend the jobs using it, but not how to stop the scheduler from sending more jobs to the node and letting the running jobs finish. Then those docs finish.

There are references to the hooks section of the docs, so I follow that link to page 898 Section, 5.2.12 Offlining Bad Vnodes which is not exactly what I want, and sure enough none of those options seem appropriate.They do mention setting comments per node, which I thought was interesting. When I look into how to read the comments set on any particular node, or all the comments on all the nodes, I can’t find that documentation either.

Any tips on the following would be appreciated:

  • how to mark a node as “draining”
  • how to list the states of all nodes with the node name
  • where the machine states are defined
  • how to list all the comments against any/all nodes

Please try this command to list all the nodes and their status
pbsnodes -aSj

To get details about individual node or job:
pbsnodes -v nodename
qstat -fx jobid

You can offline / clear all the nodes are set of nodes with these commands:

to offline the nodes
pbsnodes -o node1 node2 node3
for i in {list of nodes}; do qmgr -c “set node $i state=offline” ; done
Note: if you offline a node that is running a job, PBS Pro will allow the running job to run to completion and would make sure no new jobs are scheduled on to this off-lined nodes

To clear their status
pbsnodes -r node1 node2 node3
for i in {list of nodes}; do qmgr -c “set node $i state=free” ; done

qmgr -c “set node NODENAME state=offline”
pbsnode -o NODENAME

pbsnodes -aSj

pbsnodes -aSj
Note: machine states can be display and changed using qmgr (as stated above)

There is no direct command utility to get this result which is
pbsnodes -vaS

the below command lines might help and you can create alias for them to be handy:

pbsnodes -av | grep -e Mom -e comment |sed 'N;s/\n/ /'
pbsnodes -av | grep -e Mom -e comment | paste -d " " - -

Hope this helps.

Corrected 2019-11-06T00:00:00Z and updated the last comment-- sorry for not knowing this and thanks to @arwild01 for pointing this out.

1 Like

@adarsh comments are pretty good… but to the last comment… you can use

pbsnodes -vaS

to see a list of all your nodes each on one line and the last line is the comment current set on that node.

1 Like

Thank you @arwild01 – i did not know that (or looked into man pages :upside_down_face:)

Thank you! Much appreciated.

So for the first one I was looking for something more like this, thanks for getting me on track.

pbsnodes -aSj | tr -s ' ' | cut -d' ' -f 1,2

1 Like

Sorry, just to clarify - this is two ways to do the same thing?

Yes, thats correct. You can use any one of those .

1 Like

Just as a note, if you want to go the qmgr route, I’d do it all in one command with a comma separated list of nodes:

qmgr -c “set node node1,node2,node3 state=offline”

Bhroam

3 Likes

in pbsnodes man manual, Are they not exactly the same?
pbsnodes -o //for non-cray host with multi-vnode,means
//offline all vodes

qmgr -c “set node state=offline” //just offline single vnode
Question:
1.May be most hosts are non-Cray?
2.What is the difference between vnode and natural node
thank you!

You can offline the nodes:

  1. qmgr : set node NODENAME state=offline
  2. qmgr : set node @default state=offline

vnode and natural node: I know it is bit confusing
vnode and natural node or compute node or execution node are one and the same.

Please refer the reference guide note below:

In a multi vnode (created using configversion2 file)
For example: cnode1 , cnode1[0] , cnode1[1]
cnode1 : is the natural node or parent vnode
cnode1[0] and cnode2[0]: child vnodes