PP-35, PP-729: PBS Failover: STONITH

Hello,

This is to inform the PBSPro community about a new interface for the PBS failover situation.

Despite our best efforts in ensuring that there is no split brain scenario, the fact that NFS is used both as a datastore as well as a quorum server makes it impossible to completely negate the possibility of a split-brain (i.e., primary/secondary both decide to get active).

STONITH is Shoot The Other Node In The Head. This is an external script, that the pbs_server needs to call when it has already decided to become active. This script can be customized by an admin to call some site-specific tools/actions that “shoot” the primary dead, if alive.

Here’s the design document:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/63766571/PP-35+PP-729+PBS+Failover+-+STONITH

JIRA ticket:
https://pbspro.atlassian.net/browse/PP-35

Thanks,
Siddharth

1 Like

Thanks Siddharth, I like the name “Shoot The Other Node In The Head” :slight_smile:

Could you please explain what happens today if there’s a split brain scenario? is there an existing automated mechanism through which one of the active servers eventually comes down? If yes, then could you please explain how this is an improvement over it?

Also, will we provide a default stonith script (perhaps something that uses qterm/pbs_terminiate())? I like the idea of giving admins the ability to create their own script, but also feel that it’d be nice if we provide a default.

Hey Ravi, thanks for reviewing this. The name does catch the eye! :smile:

I may be wrong here, but I believe there is no mechanism through which one of the servers will shut itself down. This would have to be done through manual intervention of the admin.

As far as supplying a default script goes, it would be nice to have something call qterm remotely but I imagine there will be too many variables and unknowns for us to verify that our script works as intended. The site admin would have better knowledge of how to go about a failover situation and would almost always have to either make a lot of changes or even completely re-write the script. It might be better idea to leave it as an option for the admin to configure entirely.

Do let me know your thoughts on this.

Thanks for clarifying Siddharth.

If there’s no current mechanism then this is going to be a very useful enhancement!

I’m not sure I understand the complication with providing a default mechanism, and this is probably because of my lack of knowledge about how the server works, but, from the ref guide, it seems that if one just called qterm without any arguments, it shuts down the primary server and makes the secondary active. So, wouldn’t it be possible, with perhaps some modification to how pbs_terminate works, to just write a python script which calls pbs_terminate’s SWIG-fied call, which is called by the secondary to kill the primary? I imagine that qterm is what admins do today when a split-brain scenario happens to manually kill the primary. Maybe we don’t even need a script? Maybe when the secondary becomes active, it just calls pbs_terminate() to ensure that the primary is dead?

In the case where primary node is not reachable, using qterm does not guarantee that the shut down worked properly. Also, we want to bring down the node itself, and not just the services because we want to make sure that none of the services on the primary are active. The script written by the admin will cover some corner cases where primary services are still running and there is no graceful way to bring it down.

Thanks for explaining Sid. I still kind of feel that we should come up with a default script, not using qterm, but maybe something that an actual site might use like Rocks to bring the server node down, or just give a generic example explaining how one can go about creating the stonith script. But it’s up to you, maybe it’s not necessary.

a few comments on the EDD:

  • Maybe name the script ‘stonith’ instead of STONITH? caps seem a bit odd for a script name.
  • Why did you choose ‘unstable’ as change control for interfaces 1 to 7?
  • Interface 2, Could you expand “Executing STONITH script” to also mention that it’s going to bring down the primary server at host <hostname>? So, maybe something like “Executing STONITH script to bring down primary server at <hostname>” ? It might be useful information for debugging.
  • Interface 5, could you please mention how many times the secondary server will attempt to take over before it gives up?
  • Interface 4: I’d suggest adding a prefix to the error message that the script will return, otherwise they might appear to be random blobs of text and it may not be immediately obvious that they came from the stonith script. So, maybe something like “Error message returned by stonith:”. Or, you could combine this with Interface 3 so that it appears as:
    “STONITH script execution failed: <errmsg>”
  • How about making interface 7 a special case of interface 3? So, if stonith is not found, interface 3 could say “STONITH script execution failed: STONITH script not found <script path>” ?
  • That made me wonder, is there going to be a switch for the admin to say that they want us to execute stonith or not? something like “stonith_enable” ? Otherwise, a site that doesn’t want to use it/isn’t aware of it will still see the “stonith script not found” message. What do you think?
  • Interface 8: “pbs_status_db exit code rc”, you probably forgot to wrap the ‘rc’ in angular brackets, or escape them.

Hey Ravi, have made a few changes to the design document. Please do have a look. Apart from the changes:-

  • The change control is UNSTABLE because this is an external script, and not something we necessarily define. So it’s not something we have direct control over.
  • The secondary will try to execute the STONITH script till it is able to 1) take over successfully, or 2) get a heartbeat from the primary. It will continue to do this with some wait period between each check.
  • The “switch” would be the presence of the script. If the script is not present, a sort of warning will be logged saying “STONITH script not found”.

While what happens in the STONITH script is unstable, the interface should be stable. In the same way that hooks are stable, we are creating an interface that allows this script to be called. The name and location of the script well known and are stable as well.

Speaking of hooks, this has the definite feeling of a hook. Is there a reason why this isn’t being added as a secondary_startup hook event type? I thought all the new external callouts were going to be done as hooks.

Other comments
Interface 1: change the perms to 750. It doesn’t really matter because the perms on server_priv are 750, but it can’t hurt to set them right.

Interface 7: I’d remove this log message. The STONITH script is completely optional for the admin. If they opt out from writing one, why should they have a log message saying the right thing is happening?

I guess changing it to stable does make sense. Have updated it.

The reason I didn’t add this as a hook event is due to the fact that you need a server running (active) on the secondary side, which will require a data service running as well. Since we can’t be certain about the data service on the primary side being up or down when we want to “shoot the other node in the head”, we can’t start the secondary server to run a ‘STONITH hook’ because that might end up in a split brain scenario, i.e. two data services active at the same time.

I’ve made the change for Interface 1.

For Interface 7, maybe we could still have something logged, but toned down a little: “Skipping STONITH”? Or do you think it’s not required at all?

I still say it’s not required, but the toned down message looks OK too. The previous message sounded like something was going wrong. The new one is better.

Bhroam

Yes, I agree that “STONITH script not found” might have thrown people off. Have updated the design document.