This is to inform the PBSPro community about a new interface for the PBS failover situation.
Despite our best efforts in ensuring that there is no split brain scenario, the fact that NFS is used both as a datastore as well as a quorum server makes it impossible to completely negate the possibility of a split-brain (i.e., primary/secondary both decide to get active).
STONITH is Shoot The Other Node In The Head. This is an external script, that the pbs_server needs to call when it has already decided to become active. This script can be customized by an admin to call some site-specific tools/actions that âshootâ the primary dead, if alive.
Thanks Siddharth, I like the name âShoot The Other Node In The Headâ
Could you please explain what happens today if thereâs a split brain scenario? is there an existing automated mechanism through which one of the active servers eventually comes down? If yes, then could you please explain how this is an improvement over it?
Also, will we provide a default stonith script (perhaps something that uses qterm/pbs_terminiate())? I like the idea of giving admins the ability to create their own script, but also feel that itâd be nice if we provide a default.
Hey Ravi, thanks for reviewing this. The name does catch the eye!
I may be wrong here, but I believe there is no mechanism through which one of the servers will shut itself down. This would have to be done through manual intervention of the admin.
As far as supplying a default script goes, it would be nice to have something call qterm remotely but I imagine there will be too many variables and unknowns for us to verify that our script works as intended. The site admin would have better knowledge of how to go about a failover situation and would almost always have to either make a lot of changes or even completely re-write the script. It might be better idea to leave it as an option for the admin to configure entirely.
If thereâs no current mechanism then this is going to be a very useful enhancement!
Iâm not sure I understand the complication with providing a default mechanism, and this is probably because of my lack of knowledge about how the server works, but, from the ref guide, it seems that if one just called qterm without any arguments, it shuts down the primary server and makes the secondary active. So, wouldnât it be possible, with perhaps some modification to how pbs_terminate works, to just write a python script which calls pbs_terminateâs SWIG-fied call, which is called by the secondary to kill the primary? I imagine that qterm is what admins do today when a split-brain scenario happens to manually kill the primary. Maybe we donât even need a script? Maybe when the secondary becomes active, it just calls pbs_terminate() to ensure that the primary is dead?
In the case where primary node is not reachable, using qterm does not guarantee that the shut down worked properly. Also, we want to bring down the node itself, and not just the services because we want to make sure that none of the services on the primary are active. The script written by the admin will cover some corner cases where primary services are still running and there is no graceful way to bring it down.
Thanks for explaining Sid. I still kind of feel that we should come up with a default script, not using qterm, but maybe something that an actual site might use like Rocks to bring the server node down, or just give a generic example explaining how one can go about creating the stonith script. But itâs up to you, maybe itâs not necessary.
a few comments on the EDD:
Maybe name the script âstonithâ instead of STONITH? caps seem a bit odd for a script name.
Why did you choose âunstableâ as change control for interfaces 1 to 7?
Interface 2, Could you expand âExecuting STONITH scriptâ to also mention that itâs going to bring down the primary server at host <hostname>? So, maybe something like âExecuting STONITH script to bring down primary server at <hostname>â ? It might be useful information for debugging.
Interface 5, could you please mention how many times the secondary server will attempt to take over before it gives up?
Interface 4: Iâd suggest adding a prefix to the error message that the script will return, otherwise they might appear to be random blobs of text and it may not be immediately obvious that they came from the stonith script. So, maybe something like âError message returned by stonith:â. Or, you could combine this with Interface 3 so that it appears as:
âSTONITH script execution failed: <errmsg>â
How about making interface 7 a special case of interface 3? So, if stonith is not found, interface 3 could say âSTONITH script execution failed: STONITH script not found <script path>â ?
That made me wonder, is there going to be a switch for the admin to say that they want us to execute stonith or not? something like âstonith_enableâ ? Otherwise, a site that doesnât want to use it/isnât aware of it will still see the âstonith script not foundâ message. What do you think?
Interface 8: âpbs_status_db exit code rcâ, you probably forgot to wrap the ârcâ in angular brackets, or escape them.
Hey Ravi, have made a few changes to the design document. Please do have a look. Apart from the changes:-
The change control is UNSTABLE because this is an external script, and not something we necessarily define. So itâs not something we have direct control over.
The secondary will try to execute the STONITH script till it is able to 1) take over successfully, or 2) get a heartbeat from the primary. It will continue to do this with some wait period between each check.
The âswitchâ would be the presence of the script. If the script is not present, a sort of warning will be logged saying âSTONITH script not foundâ.
While what happens in the STONITH script is unstable, the interface should be stable. In the same way that hooks are stable, we are creating an interface that allows this script to be called. The name and location of the script well known and are stable as well.
Speaking of hooks, this has the definite feeling of a hook. Is there a reason why this isnât being added as a secondary_startup hook event type? I thought all the new external callouts were going to be done as hooks.
Other comments
Interface 1: change the perms to 750. It doesnât really matter because the perms on server_priv are 750, but it canât hurt to set them right.
Interface 7: Iâd remove this log message. The STONITH script is completely optional for the admin. If they opt out from writing one, why should they have a log message saying the right thing is happening?
I guess changing it to stable does make sense. Have updated it.
The reason I didnât add this as a hook event is due to the fact that you need a server running (active) on the secondary side, which will require a data service running as well. Since we canât be certain about the data service on the primary side being up or down when we want to âshoot the other node in the headâ, we canât start the secondary server to run a âSTONITH hookâ because that might end up in a split brain scenario, i.e. two data services active at the same time.
Iâve made the change for Interface 1.
For Interface 7, maybe we could still have something logged, but toned down a little: âSkipping STONITHâ? Or do you think itâs not required at all?
I still say itâs not required, but the toned down message looks OK too. The previous message sounded like something was going wrong. The new one is better.