we have a query on execjob_prologue hook.
when job dispatched to a host, execjob_prologue hook will check some setup before the job start running. if something on the host is wrong, the job will be rejected.
the basic function works fine. however, the job seems not to be requeued, it will be always dispatched on the same host and run 20 times then get HELD status.
could you provide some suggestion how to avoid this ?
You could offline the vnode if the health check fails. From the Hooks Guide:
vnlist = pbs.event().vnode_list
for v in vnlist.keys():
vnlist[v].state = pbs.ND_OFFLINE
vnlist[v].comment = "bad configuration"
pbs.event().reject("not accepting jobs")
Thanks for the reply.
however, we run the health check on pbsuser level, I think offline action requires pbsadmin .
any suggestion on this ?
+1 @smgoosen suggestion
@squallabc : wanted to get your input here, what would you like to do
with respect to the job ? - do you want check the run_count on that job and put it on user hold and later admin can qalter the job or check why the job has run_count of more than 1 ( or X )
with respect to the node that has issues ? # cronjob or user level cronjob on the nodes
would you be to get the list of exec vnodes given to a job in the RUNJOB hook, check their status using remote ssh from with the script for the node health , if OK process , otherwise, put it back in the queue with a message or if the exec node in the list is buggy one (which you know) , then reject the job and put it back in the queue.
Could I ask why you’d run a node health check as a user and not root?
You can set the hook’s fail_action to offline_vnodes. If the hook encounters an unhandled exception or hits the alarm, it will offline the vnodes.
This works even if the hook’s user is pbsuser.
some dirs are only access by user only . so even root is not able to access. that 's why we need to use pbsuser to run the health check.
Thanks. let me try this and update the result.