Avoid using an unstable node

Ikki · April 11, 2018, 2:10am

Hi folks,

Is there any way to avoid executing jobs on an unstable node?

We have recently experienced a physical trouble with a GPU that causes a segmentation fault in jobs executed on that node.
After the job fails with exit_status=139, PBS starts executing another job on that node, and it fails, and so on.
As a result, a single unstable node behaves like a blackhole, making the whole complex useless.

I guess this kind of situation occurs frequently in a large system and people have established a best practice against such a trouble.
Could someone suggest a smart way to handle this situation with PBS Pro?
It would be my pleasure to learn the wisdom of predecessors

Regards,

adarsh · April 11, 2018, 10:42am

you can use exechost_periodic hook to do the health checks on these nodes and if there is an issue, put the node in the offline state, manually this can be achieved by running the below command
qmgr -c 's n NODENAME state=offline"

Also, please refer the PBS Pro Admin guide: https://pbsworks.com/pdfs/PBS14.2.1_BigBook.pdf
5.2.6 Offlining and Clearing Vnodes Using the fail_action Hook Attribute

Ikki · April 23, 2018, 2:24am

Hi @adarsh,

Thank you for your kind advice.
I understand it is necessary to detect the failure with my hook.
More essentially, it is necessary to clearly define the type/level of failure we need to automatically detect in our complex.
There is no single answer to the question and I’ll continue discussing with my colleagues.
Any comments and advice from experienced administrators would be appreciated.

Regards,

adarsh · April 23, 2018, 10:57am

The hooks can be used to detect all type/level of failures periodically and designated to take respective actions.
If you can share us the list of failures that needs to be detected, that will be helpful to discuss. The community will get to know and they can share their feedback.

Thank you

Ikki · April 24, 2018, 4:19am

Hi @adarsh,

Thank you for your kind suggestion.
So far we have no clear idea of “the list of failures”.
I’ll talk with you guys again when we come up with an idea.

Regards,

Topic		Replies	Views
PBS Design Changes for Shasta support Developers	11	1113	March 27, 2020
Hook to Take Nodes Offline Users/Site Administrators	11	2517	June 4, 2019
Execjob_prologue hook quesetion Users/Site Administrators	7	729	August 30, 2019
Execution node down Users/Site Administrators	7	2659	August 9, 2019
How to take a node offline Users/Site Administrators	9	7992	November 1, 2021

Avoid using an unstable node

Related topics