PP-682: PBS Comm to registered new connection request for already registered node

This is a TPP related bug that shows up in a IP failover situation.
In some customer site failover is configured is such a way that, when the primary machine fails, cluster manager starts the daemons on the secondary, and fails-over the IP-address along with it.

The daemon (e.g. mom) is started up on another host (by cluster manager) but has the same IP address as the primary (the IP address itself is failed over, and is the usual way cluster managers perform failover).

Since the primary machine went down abruptly, the TCP connection of the mom to the pbs_comm was not yet broken, and when a new connection comes from the restarted mom, pbs_comm keeps rejecting it. (saying the IP address is already registered).

The proposed solution fixes the situation as follows:
When a connection arrives, pbs_comm checks whether the IP address is already registered (and still registered) and if so, drops the new connection.
However, now, instead of dropping the new connection, it will close the older connection such that the new connection can be accepted.

Please review the interface change document for the above solution

Are there security implications to consider here? What if a user were to mimic the traffic from pbs_mom to pbs_comm in an attempt to get pbs_comm to hang up on a functioning mom? Should pbs_comm first validate that the registered connection has been broken before it drops it and accepts the new request?

It seems to me that this might open us up either to a DoS attack based on IP spoofing or to a security breach by MOM impersonation. I think that’s Mike’s concern as well.

Just ip spoofing is not enough. As before, for the comm to accept the connect request there is authentication. Currently, we support two ways, reserved port as well as munge. Besides just being able to register to the comm does not mean the other PBS daemons will accept messages from it - moms, server accept message when they come from the IP address that they care/know about. Rest are rejected.

In any case, this is as secure as it was before. If a site uses munge it will be even more secure.

However, in general we do not yet have encrypted communications so IP spoofing is possible, as before. If the user has root level access to the machine or the network then communications can be compromised. That is why if we have communications spanning a local cluster and cloud (for example) we should use a secure IP tunnel/VPN the two sites.