I have had several instances lately where a node on my cluster has locked up. I can’t ssh or ping it. Basically requiring me to power cycle it.
In all cases OpenPBS thinks the node is free. I guess I was under the mistaken impression that the queue periodically checks the nodes and marks it down if it is unreachable.
Can someone clarify how this works? I’m having problems finding more information.
The problem you have described above might not be related to openPBS and might be related to the system/compute resource.
If the node is down or not reachable, PBS Scheduler re-schedules the job to another capable node and if this is not available, then job remains in the queue until the capable resources are available.
if you could provide us more context, configuration, logs that would be helpful.
Please refer to these documentation: