Hook to Take Nodes Offline

Hi,

I am working on a hook that takes nodes offline. PBS hook documentation https://www.pbsworks.com/pdfs/PBSHooks14.2.pdf section 9.7 presents us one way but my requirements were different as my node offline takes place on the basis of exit code from job.

The below code gives us an output file pqr.txt which contains name of the nodes on which job was submitted. Sample output node1, node2 …
My objective is to offline all the nodes retuned from pqr.txt.

Now if I use the below concept in my code

if ((current_state == pbs.ND_OFFLINE) == 0):
vnl[new_list[y]].state = pbs.ND_OFFLINE
vnl[new_list[y]].comment = “offlined node as it is heavily loaded”
print >> fd_out1, 'Ele is = ’ + new_list[y]
as an output the node on which job was submitted goes offline. Though I had requested 3 nodes all those 3 should have gone offline.

Another approach I followed is to call a subprocess i.e a bash script within the below code

import pbs
import os
import re
import sys
import subprocess

e = pbs.event()
try:
if e.job.in_ms_mom():
exit_code = str(e.job.Exit_status)

    execution_vnode = str(e.job.exec_vnode)
    execution_vnode1 = execution_vnode.split('+')

    vnl = e.vnode_list
    new_list = []

    if int(exit_code) != 0:
        report_file3 = str("/home/centos/pqr.txt")
        pbs.logmsg(pbs.LOG_DEBUG, "report_usage file1 is %s" % report_file3)

        fd_out1 = open(report_file3, 'w+')
        print >> fd_out1, 'To: id@gmail.com'
        print >> fd_out1, 'From: id@ndsu.edu'
        print >> fd_out1, 'Subject: Node Taken offline'

        for x in range(len(execution_vnode1)):
            data = execution_vnode1[x].split(':')
            new_list.append(data[0][1:])

        for y in range(len(new_list)):
            current_state = pbs.server().vnode(new_list[y]).state
            if ((current_state == pbs.ND_OFFLINE) == 0):
                vnl[new_list[y]].comment = "offlined node as it is heavily loaded"
                print >> fd_out1, '' + new_list[y]
            else:
                vnl[new_list[y]].comment = "exit status of the job is not negative"
        fd_out1.close()

        mail_cmd="/usr/sbin/sendmail -t \"PBS OSS\" < /home/centos/pqr.txt"
        pbs.logmsg(pbs.LOG_DEBUG, "mail_command  is %s" % mail_cmd)
        os.popen(mail_cmd)

        os.system('sh /home/centos/offline.sh')

except SystemExit:
pass
except:
pbs.logmsg(pbs.LOG_DEBUG, “report_usage: failed with %s” % str(sys.exc_info()))
pass

Bash Script named as offline.sh. This takes input from a file

#!/bin/bash

input="/home/centos/input.txt"
while IFS= read -r line
do
if [[ “$line” = node ]];
then
pbsnodes -o $line
fi
done < input.txt

The problem that I am facing is the subprocess i.e bash script is unable to perform any action on vnode. i don’t know whether calling of bash script is correct in my code. I have tried various ways like subprocess.call() and other methods but none of them works. In case I run bash script indicidually it is able to perform the desired action. This signifies bash script is correct but why its not working through hook.

pbsnodes -o is not recommended method to offline a node.
The recommended method would be qmgr -c “set node NODENAME state=offline”

You can use mom_periodic hook to offline the nodes , the mom_periodic hook reads this file /home//pqr.txt and offline the nodes based on its content.

1 Like

Hi Adarsh,

Are you referring to Section 9.7 Example 9-14 in Hook Guide?

Thanks,
Rakhen

Could you please share the mom_periodic hook?

Rakhen, it is bit tricky and i understand now.
Some workarounds:

  1. Submit node exclusive jobs , so that other jobs are not allowed to run, when 1 job is running
    qsub -l select=3:ncpus=3 :mem=10gb -l place=excl -- /bin/sleep 100

  2. execjob_begin hook to write the vnodes to a file and then have a cronjob read these files and run
    [ the below hook needs some updates in piping the command to file ]

import os
import sys
import pbs

e = pbs.event()
j = e.job
jobid = j.id

if e.type == pbs.EXECJOB_BEGIN:
        pbs.logmsg(pbs.LOG_DEBUG, "JOB NODES in EXECJOB BEGIN HOOK")
        pbs.logmsg(pbs.LOG_DEBUG, "JOB NODES in EXECJOB BEGIN HOOK %s " % str(jobid))
        job_nodes = str(j.exec_vnode)
        j.Variable_List['EXECUTION_NODES']= job_nodes
        pbs.logmsg(pbs.LOG_DEBUG, "EXECJOB NODES are is %s" % (job_nodes))
        chunk_list=job_nodes.split('+')
        node_list=[]
        for chunk in chunk_list:
                pbs.logmsg(pbs.LOG_DEBUG, "EXECJOB BEGIN CHUNK %s " % (chunk))
                chunk = chunk.split(':')[0]
                node_list.append(chunk.split('(')[1])
        for node in node_list:
                pbs.logmsg(pbs.LOG_DEBUG, "JOB NODES in EXECJOB BEGIN NODE %s " % str(node))
                cmd = "echo \"/opt/pbs/bin/qmgr -c \"set node " + str(node) + "state = offline \" """+  " > /tmp/" + str(jobid) + ".txt"
                pbs.logmsg(pbs.LOG_DEBUG, "JOB NODES in EXECJOB BEGIN CMD %s " % str(cmd))
                os.system(cmd)
                cmd2 = "echo \"/opt/pbs/bin/qmgr -c \"set node " + str(node) + "'comment = offline due to heavy load'  \" """+ ">> /tmp/" + str(jobid) + ".txt"
                pbs.logmsg(pbs.LOG_DEBUG, "JOB NODES in EXECJOB BEGIN CMD %s " % str(cmd2))
                os.system(cmd2)

Hi Adarsh,

I have tried the code that you have shared in point 2. For me it puts my job in hold state.

Could you please try the below code?
As a result of this code I get the names of the nodes on which job was submitted which shows print >> fd_out1, ‘’ + new_list[y] works fine. But my objective of taking the node offline is still not achieved.

qmgr -c ‘create hook New_ExitStatus_NodeOfflin event=“execjob_begin”’
qmgr -c ‘import hook New_ExitStatus_NodeOfflin application/x-python default New_ExitStatus_NodeOfflin.py’

import pbs
import os
import re
import sys

e = pbs.event()

try:
if e.type == pbs.EXECJOB_BEGIN:
execution_vnode = str(e.job.exec_vnode)
execution_vnode1 = execution_vnode.split(’+’)

    vnl = e.vnode_list
    new_list = []

    report_file3 = str("/home/centos/pqr.txt")
    pbs.logmsg(pbs.LOG_DEBUG, "report_usage file3 is %s" % report_file3)

    fd_out1 = open(report_file3, 'w+')
    print >> fd_out1, 'To: @gmail.com'
    print >> fd_out1, 'From: @gmail.com'
    print >> fd_out1, 'Subject: Node Taken offline'

    for x in range(len(execution_vnode1)):
        data = execution_vnode1[x].split(':')
        new_list.append(data[0][1:])

    for y in range(len(new_list)):
        cmd = "echo \"/opt/pbs/bin/qmgr -c \"set node " + str(new_list[y]) + "state = offline \" """+  " > /tmp/" + str(e.job.id) + ".txt"
        pbs.logmsg(pbs.LOG_DEBUG, "JOB NODES in EXECJOB BEGIN CMD %s " % str(cmd))
        os.system(cmd)
        print >> fd_out1, '' + new_list[y]
    fd_out1.close()

    mail_cmd="/usr/sbin/sendmail -t \"PBS OSS\" < /home/centos/pqr.txt"
    pbs.logmsg(pbs.LOG_DEBUG, "mail_command  is %s" % mail_cmd)
    os.popen(mail_cmd)

except SystemExit:
pass
except:
pbs.logmsg(pbs.LOG_DEBUG, “report_usage: failed with %s” % str(sys.exc_info()))
pass

Output that I get
image
Node Status is still free

Thank you Rakhen, the below hooks worked for me
I split up the tasks into two hooks:

  1. execjob_begin – to write the /tmp/*_remove.txt file
  2. exechost_periodic – to find out local node name is in this file and then offline / comment the node
    X. more things can be added to cleanup and all that , but basic’s work.

EXECJOB_BEGIN hook:

import pbs
import os
import re
import sys

e = pbs.event()

try:
    if e.type == pbs.EXECJOB_BEGIN:
        execution_vnode = str(e.job.exec_vnode)
        execution_vnode1 = execution_vnode.split('+')
        vnl = e.vnode_list
        new_list = []

        report_file3 = str("/home/centos/pqr.txt")
        pbs.logmsg(pbs.LOG_DEBUG, "report_usage file3 is %s" % report_file3)

        fd_out1 = open(report_file3, 'w+')
        print >> fd_out1, 'To: @gmail.com'
        print >> fd_out1, 'From: @gmail.com'
        print >> fd_out1, 'Subject: Node Taken offline'

        for x in range(len(execution_vnode1)):
            data = execution_vnode1[x].split(':')
            new_list.append(data[0][1:])

        for y in range(len(new_list)):
            cmd = "echo \"/opt/pbs/bin/qmgr -c \\\"set node " + str(new_list[y]) + " state = offline \\\" \" " "" + " > /tmp/" + str(e.job.id) + "_remove.txt"
            pbs.logmsg(pbs.LOG_DEBUG, "JOB NODES in EXECJOB BEGIN CMD %s " % str(cmd))
            os.system(cmd)
            print >> fd_out1, '' + new_list[y]
        fd_out1.close()

        mail_cmd = "/usr/sbin/sendmail -t \"PBS OSS\" < /home/centos/pqr.txt"
        pbs.logmsg(pbs.LOG_DEBUG, "mail_command  is %s" % mail_cmd)
        os.popen(mail_cmd)

except SystemExit:
    pass
except:
    pbs.logmsg(pbs.LOG_DEBUG, "report_usage: failed with %s" % str(sys.exc_info()))
    pass

EXECHOST_PERIODIC hook:

import os
import sys
import pbs


pbs.logmsg(pbs.LOG_DEBUG,"START EXECHOST PERIODIC HOOK ")
local_node = pbs.get_local_nodename()
pbs.logmsg(pbs.LOG_DEBUG, "END EXECHOST PERIODIC HOST HOOK local node hostname %s  " % str(local_node))

vnl = pbs.event().vnode_list
txt_files = [f for f in os.listdir('/tmp') if f.endswith('_remove.txt')]
for file in txt_files:
    full_file_path=os.path.join('/tmp', file)
    pbs.logmsg(pbs.LOG_DEBUG, "END EXECHOST PERIODIC HOST HOOK filename in tmp is  %s  " % str(full_file_path))
    f = open(full_file_path, 'r').read()
    pbs.logmsg(pbs.LOG_DEBUG, "END EXECHOST PERIODIC HOST HOOK contents of file are %s  " % str(f))
    if local_node in f:
        pbs.logmsg(pbs.LOG_DEBUG, "END EXECHOST I am here")
        vnl[local_node].state = pbs.ND_OFFLINE
        vnl[local_node].comment = "offlined node as it is heavily loaded"
        pbs.logmsg(pbs.LOG_DEBUG, "END EXECHOST PERIODIC HOST HOOK offline node %s  " % str(local_node))
        #os.remove(full_file_path)
    #pbs.logmsg(pbs.LOG_DEBUG, "END EXECHOST PERIODIC HOST HOOK FULL FILE %s is removed " % full_file_path)
pbs.logmsg(pbs.LOG_DEBUG, "END EXECHOST PERIODIC HOST HOOK")

Hi Rakhen,

If I understand your problem correctly (i.e., you have to make nodes offline based on last job’s exit status ??)

You can use execjob_epilogue event(which runs after job completion)
[root@vm1 vk]# cat test.py

import pbs


e = pbs.event()
j = e.job

vn = e.vnode_list
pbs.logmsg(pbs.LOG_DEBUG, "job stat: %s" % j.Exit_status)
if int(j.Exit_status) !=0:
    for node in str(j.exec_vnode).split('+'):
        node = node.split(':')[0].replace('(','').replace(')','')
        vn[node].state = pbs.ND_OFFLINE
        vn[node].comment = "Taking node offline"
        e.reject()

And also to send out mail you can either use shell script using subprocess or python modules.

1 Like

@vishwaks : Rakhen wanted to offline the nodes based this text tile.

Thanks @adarsh and yeah periodic hook is better for this one.

Hi Vishwaks,

To add on my objective is to take multiple nodes offline. Eg: During PBS job submission if a user requests for 3 nodes and if the Exit_status of the job !=0 then all those three nodes should go offline.

I had tried a similar kind of code but the problem with the end result is that only one node is able to go offline where in the others remains free after job ends.

Also the text file pqr.txt that is being talked about is just for writing the name of the nodes on which job was submitted to a file that are being fetched from PBS environment.

Thanks,
Rakhen

Hi Raken,

I just checked/observed that job structure is not shared between moms(for multinode job) so exit status will be non-zero in primary mom and zero in sister moms.

So as per my understanding - you need a script which read accounting logs for ‘E’ record and get jobs with non-zero exit status and perform state=offline on nodes (by reading exec_vnode job parameter).
And run this script periodically (using server periodic hook or crontab).

But in this we have to cache(or store) last read job id or time stamp by that we can avoid reading same records.

1 Like