Is there any way to avoid executing jobs on an unstable node?
We have recently experienced a physical trouble with a GPU that causes a segmentation fault in jobs executed on that node.
After the job fails with exit_status=139, PBS starts executing another job on that node, and it fails, and so on.
As a result, a single unstable node behaves like a blackhole, making the whole complex useless.
I guess this kind of situation occurs frequently in a large system and people have established a best practice against such a trouble.
Could someone suggest a smart way to handle this situation with PBS Pro?
It would be my pleasure to learn the wisdom of predecessors
you can use exechost_periodic hook to do the health checks on these nodes and if there is an issue, put the node in the offline state, manually this can be achieved by running the below command
qmgr -c 's n NODENAME state=offline"
Also, please refer the PBS Pro Admin guide: https://pbsworks.com/pdfs/PBS14.2.1_BigBook.pdf
5.2.6 Offlining and Clearing Vnodes Using the fail_action Hook Attribute
Thank you for your kind advice.
I understand it is necessary to detect the failure with my hook.
More essentially, it is necessary to clearly define the type/level of failure we need to automatically detect in our complex.
There is no single answer to the question and I’ll continue discussing with my colleagues.
Any comments and advice from experienced administrators would be appreciated.
The hooks can be used to detect all type/level of failures periodically and designated to take respective actions.
If you can share us the list of failures that needs to be detected, that will be helpful to discuss. The community will get to know and they can share their feedback.
Thank you for your kind suggestion.
So far we have no clear idea of “the list of failures”.
I’ll talk with you guys again when we come up with an idea.