My primary and secondary servers both have MOM enabled (PBS_START_MOM=1), and they are always used last in the cluster. It is easy to write a simple stonith script, such as running a command to shut down the host power of the primary server, but it is very rough. It does not apply to the following situations:
The primary server’s pbs_server crashes and the secondary server needs to take over, but the primary server’s mom is still executing jobs
Due to a network failure, the secondary server cannot ping the primary server, but the primary server is running normally and there are jobs executing in the cluster, so the secondary server does not need to take over
How should the stonith script be adjusted? Make it more reasonable.
It is not recommended to run pbs_mom/execution component on the PBS server.
Also, not on the failover pairs of PBS Servers.
Some thoughts:
write a watcher script on the PBS Server host, if the network is down or pbs_server is down then the this script will kill the pbs_mom process.
Secondary server heart beat is broken , it thinks PBS Primary is not reachable or down , hence it needs to take over. This is the way it should work. As it cannot find out whether other side is actually working but its connectioin the primary is only broken. Again, a script to login into the compute nodes connective to Primary , to find out whether the Primary is active and healthy.
Thanks sincerely for your advice, Mr adarsh.
By the way, when I shutdown the primary server host, and the secondary server takes over after a delay time, the job continues. And I use qstat -Bf, but I can’t see the server_host changes. It always prints the primary server host’s name.
This is my secondary server’s status:
The PBS_HOME is shared by primary and secondary server, if the svrlive file updates periodically, it seems that the stonith isn’t necessary, even if the connectioin the primary to secondary is only broken. If svrlive file doesn’t update, it means that the primary server’s pbs or the hole cluster has problem, if it is the latter, the secondary server is meaningless to take over.
But why it writes nothing in my svrlive file when executes a job? I should created correct share PBS_HOME.
share on which the PBS_HOME resides should have file locking mechanism (global file lock) . Otherwise, there will be a split brain situation both primary and secondary think they are in control and corrupt the datastore by writing to it at the same time.
the svrlive is empty file, please note only the timestamp of that file is checked and saved in the secondary servers memory and compares at regular intervals, if the timestamp has not changed, then it tries top open a connection to primary, if it fails, then it takes over
The communiction between primary and secondary has broken, so now it is depenent only on the svrlive file.
When you submit a job in the failover setup, the job id would be 100. and it has to be that way , even the failover has happened to secondary. We do not want 101. , this would mean two separate servers and also not good for accounting. It is one PBS Complex with a failover setup.
Yeah, I understand. I can watch the process to see what server(pbs_sched) is active, but I must switch host. Is there a more intuitive and convenient way to view changes?
It is a rare event that the timestamp of the svrlive file does not change. For example, there is a problem with the primary server mounting PBS_HOME, or the network of the primary server in the cluster fails alone. Setting “stonith” is also to prevent these rare events from happening.