when job dispatched to a host, execjob_prologue hook will check some setup before the job start running. if something on the host is wrong, the job will be rejected.
the basic function works fine. however, the job seems not to be requeued, it will be always dispatched on the same host and run 20 times then get HELD status.
could you provide some suggestion how to avoid this ?
@squallabc : wanted to get your input here, what would you like to do
with respect to the job ? - do you want check the run_count on that job and put it on user hold and later admin can qalter the job or check why the job has run_count of more than 1 ( or X )
with respect to the node that has issues ? # cronjob or user level cronjob on the nodes
would you be to get the list of exec vnodes given to a job in the RUNJOB hook, check their status using remote ssh from with the script for the node health , if OK process , otherwise, put it back in the queue with a message or if the exec node in the list is buggy one (which you know) , then reject the job and put it back in the queue.