Execjob_prologue hook quesetion

squallabc · August 26, 2019, 9:36am

Hi ,

we have a query on execjob_prologue hook.

when job dispatched to a host, execjob_prologue hook will check some setup before the job start running. if something on the host is wrong, the job will be rejected.

the basic function works fine. however, the job seems not to be requeued, it will be always dispatched on the same host and run 20 times then get HELD status.

could you provide some suggestion how to avoid this ?

Thanks

smgoosen · August 26, 2019, 2:54pm

You could offline the vnode if the health check fails. From the Hooks Guide:

vnlist = pbs.event().vnode_list
for v in vnlist.keys():
   vnlist[v].state = pbs.ND_OFFLINE
   vnlist[v].comment = "bad configuration"
pbs.event().reject("not accepting jobs")

squallabc · August 28, 2019, 4:30am

Thanks for the reply.
however, we run the health check on pbsuser level, I think offline action requires pbsadmin .

any suggestion on this ?

Thanks.

adarsh · August 29, 2019, 2:23pm

+1 @smgoosen suggestion

@squallabc : wanted to get your input here, what would you like to do

with respect to the job ? - do you want check the run_count on that job and put it on user hold and later admin can qalter the job or check why the job has run_count of more than 1 ( or X )
with respect to the node that has issues ? # cronjob or user level cronjob on the nodes
would you be to get the list of exec vnodes given to a job in the RUNJOB hook, check their status using remote ssh from with the script for the node health , if OK process , otherwise, put it back in the queue with a message or if the exec node in the list is buggy one (which you know) , then reject the job and put it back in the queue.

smgoosen · August 29, 2019, 3:21pm

Could I ask why you’d run a node health check as a user and not root?

vstumpf · August 29, 2019, 8:31pm

You can set the hook’s fail_action to offline_vnodes. If the hook encounters an unhandled exception or hits the alarm, it will offline the vnodes.

This works even if the hook’s user is pbsuser.

squallabc · August 30, 2019, 2:21am

some dirs are only access by user only . so even root is not able to access. that 's why we need to use pbsuser to run the health check.
Thanks.

squallabc · August 30, 2019, 2:24am

Thanks. let me try this and update the result.

Topic		Replies	Views
Avoid using an unstable node Users/Site Administrators	4	837	April 24, 2018
Hook to Take Nodes Offline Users/Site Administrators	11	2521	June 4, 2019
Execjob_prologue feedback Users/Site Administrators	6	1412	May 31, 2019
How to work with NodeHealthCheck.py (unsupported) Users/Site Administrators	1	346	March 3, 2021
Synchronize all moms after a hook Developers	4	643	March 12, 2019

Execjob_prologue hook quesetion

Related topics