Reasonable stonith script

wakaka · May 21, 2025, 7:17am

My primary and secondary servers both have MOM enabled (PBS_START_MOM=1), and they are always used last in the cluster. It is easy to write a simple stonith script, such as running a command to shut down the host power of the primary server, but it is very rough. It does not apply to the following situations:

The primary server’s pbs_server crashes and the secondary server needs to take over, but the primary server’s mom is still executing jobs
Due to a network failure, the secondary server cannot ping the primary server, but the primary server is running normally and there are jobs executing in the cluster, so the secondary server does not need to take over
How should the stonith script be adjusted? Make it more reasonable.

adarsh · May 22, 2025, 7:53am

It is not recommended to run pbs_mom/execution component on the PBS server.
Also, not on the failover pairs of PBS Servers.

Some thoughts:

write a watcher script on the PBS Server host, if the network is down or pbs_server is down then the this script will kill the pbs_mom process.
Secondary server heart beat is broken , it thinks PBS Primary is not reachable or down , hence it needs to take over. This is the way it should work. As it cannot find out whether other side is actually working but its connectioin the primary is only broken. Again, a script to login into the compute nodes connective to Primary , to find out whether the Primary is active and healthy.

wakaka · May 22, 2025, 8:03am

Thanks sincerely for your advice, Mr adarsh.
By the way, when I shutdown the primary server host, and the secondary server takes over after a delay time, the job continues. And I use qstat -Bf, but I can’t see the server_host changes. It always prints the primary server host’s name.
This is my secondary server’s status:

wakaka · May 22, 2025, 11:59am

The PBS_HOME is shared by primary and secondary server, if the svrlive file updates periodically, it seems that the stonith isn’t necessary, even if the connectioin the primary to secondary is only broken. If svrlive file doesn’t update, it means that the primary server’s pbs or the hole cluster has problem, if it is the latter, the secondary server is meaningless to take over.
But why it writes nothing in my svrlive file when executes a job? I should created correct share PBS_HOME.

adarsh · May 22, 2025, 6:41pm

share on which the PBS_HOME resides should have file locking mechanism (global file lock) . Otherwise, there will be a split brain situation both primary and secondary think they are in control and corrupt the datastore by writing to it at the same time.
the svrlive is empty file, please note only the timestamp of that file is checked and saved in the secondary servers memory and compares at regular intervals, if the timestamp has not changed, then it tries top open a connection to primary, if it fails, then it takes over

The communiction between primary and secondary has broken, so now it is depenent only on the svrlive file.

Please check this guide and the below section: https://help.altair.com/2024.1.0/PBS%20Professional/PBS2024.1.pdf
8.2.5.3.iii Configuring STONITH Script for Use by Secondary Server

adarsh · May 23, 2025, 6:56am

When you submit a job in the failover setup, the job id would be 100. and it has to be that way , even the failover has happened to secondary. We do not want 101. , this would mean two separate servers and also not good for accounting. It is one PBS Complex with a failover setup.

wakaka · May 23, 2025, 7:42am

Yeah, I understand. I can watch the process to see what server(pbs_sched) is active, but I must switch host. Is there a more intuitive and convenient way to view changes?

wakaka · May 23, 2025, 7:52am

It is a rare event that the timestamp of the svrlive file does not change. For example, there is a problem with the primary server mounting PBS_HOME, or the network of the primary server in the cluster fails alone. Setting “stonith” is also to prevent these rare events from happening.

Topic		Replies	Views
PP-35, PP-729: PBS Failover: STONITH Developers	10	1324	September 12, 2017
Primary server take over after failover Users/Site Administrators	0	265	November 7, 2023
How to PBS server and sched graceful shutdown ,and preserving PBS MOMs jobs? Users/Site Administrators	12	2910	July 15, 2021
Primary server unseccessfully takes over again Users/Site Administrators	5	29	May 27, 2025
Sharing all of /var/spool/pbs on HA server hosts Users/Site Administrators	5	1278	October 29, 2021

Reasonable stonith script

Related topics