PP-682: PBS Comm to registered new connection request for already registered node

dilip-krishnan · March 31, 2017, 8:17am

This is a TPP related bug that shows up in a IP failover situation.
In some customer site failover is configured is such a way that, when the primary machine fails, cluster manager starts the daemons on the secondary, and fails-over the IP-address along with it.

The daemon (e.g. mom) is started up on another host (by cluster manager) but has the same IP address as the primary (the IP address itself is failed over, and is the usual way cluster managers perform failover).

Since the primary machine went down abruptly, the TCP connection of the mom to the pbs_comm was not yet broken, and when a new connection comes from the restarted mom, pbs_comm keeps rejecting it. (saying the IP address is already registered).

The proposed solution fixes the situation as follows:
When a connection arrives, pbs_comm checks whether the IP address is already registered (and still registered) and if so, drops the new connection.
However, now, instead of dropping the new connection, it will close the older connection such that the new connection can be accepted.

Please review the interface change document for the above solution

mkaro · March 31, 2017, 4:10pm

Are there security implications to consider here? What if a user were to mimic the traffic from pbs_mom to pbs_comm in an attempt to get pbs_comm to hang up on a functioning mom? Should pbs_comm first validate that the registered connection has been broken before it drops it and accepts the new request?

sgombosi · March 31, 2017, 5:05pm

It seems to me that this might open us up either to a DoS attack based on IP spoofing or to a security breach by MOM impersonation. I think that’s Mike’s concern as well.

subhasisb · April 3, 2017, 8:30am

Just ip spoofing is not enough. As before, for the comm to accept the connect request there is authentication. Currently, we support two ways, reserved port as well as munge. Besides just being able to register to the comm does not mean the other PBS daemons will accept messages from it - moms, server accept message when they come from the IP address that they care/know about. Rest are rejected.

In any case, this is as secure as it was before. If a site uses munge it will be even more secure.

However, in general we do not yet have encrypted communications so IP spoofing is possible, as before. If the user has root level access to the machine or the network then communications can be compromised. That is why if we have communications spanning a local cluster and cloud (for example) we should use a secure IP tunnel/VPN the two sites.

Topic		Replies	Views
Communication trouble between pbs_comm, server and MoM when node IP updated Users/Site Administrators	0	595	August 18, 2022
PP-1040: moms cannot communicate with one another in a cloud configuration when cloud nodes resolve each other's hostnames to IP addresses not known to the PBS server/comm Developers	1	867	November 28, 2017
Comm daemon and failover Users/Site Administrators	17	1386	August 20, 2021
Mom states assumed to be last-known-state when pbs_comm fails? Developers	3	2017	August 11, 2016
Primary server take over after failover Users/Site Administrators	0	174	November 7, 2023

PP-682: PBS Comm to registered new connection request for already registered node

Related Topics