MOM will initiate the dialogue sequence with server

I’ve posted the design here.

Please share your thoughts.

Thanks for posting this Nithin. A few questions:

  1. It seems like IS_NULL is used by server to keep a track of which moms are up. Can you please explain in the document how that will be achieved if we remove IS_NULL?
  2. today when the server sends IS_HELLO, mom replies back with its list of jobs and state. Will the same be sent along with IS_HELLOSVR? Will the server reply back anything?
  3. how will this work in situations where a server goes down? Will the mom keep trying to connect to the primary? Will it try secondary?

Thanks Ravi for the response.

  1. IS_NULL was used to check whether mom became up once the server marked it as offline. Going forwards we do not need this mechanism as MOM will initiate a hello exchange anyway.
  2. When a server receives a hello from mom, it sends it ack which will be followed by the message exchange we have today. I’ve added this missing piece into the docs.
  3. Mom will be treating two of the servers in its list with equal precedence and will be sending hello randomly to one of them. It can receive a reply back only from the active server. Mom will follow the same in the future when we have active-active failover. I’ve added this into the doc.

Let me know what you think…

How will the server know when a mom goes down without IS_NULL?

Pardon my lack of knowledge about the server-mom protocol, are you saying that mom will send hello messages to server even after it has established connection already? So, it sends hello when it comes up, servers replies and they exchange info … then 5 minutes later the server dies, does the mom get notified of this? or will it send another hello after x minutes to the same server and realize that it’s down and then send the hello to any other server that’s up?

"Server reaching out to every mom it knows does not make much sense if it can share that load between servers. Especially when initial dialogue between server and moms are long and computationally expensive. The proposed design will allow the mom to choose a server and continue the dialogue with it essentially sharing the load between servers."

Again, just trying to understand, if each server knows only about a set of moms, how will server1 run a job on mom2 which it does NOT know about? scheduler might ask server to run its jobs on nodes that it doesn’t know about right?

Both the daemons, mom and serve will get notified by the network layer (comm) that the connection is broken. Server will mark the node as down, whereas mom will re-attempt to connect.

During the hello exchange, once the server receives the information about the mom, it will get saved to the database. When another server wants to talk to the same mom, it will load that information and do an rpp_open and start communicating.
This PR will make sure that information is written to the database. Server talking to an alien mom is not in the scope of the proposed work.

Your design mentions sharding moms in a multi-server scenario as one of the motivations, so I think it is important to explain how this will work when moms are sharded, I don’t think it’s out of scope.

Is this how server detects that a mom is down today as well? If yes, then ignore the rest of my comment:
Since TCP doesn’t support detecting dropped connections (Detection of Half-Open (Dropped) Connections), does pbs_comm send keepalive/IS_NULL type messages itself? or does it just report a connection as down when a message is sent over it? if it’s the latter then a node getting reported down will be delayed which might have implications. Since you are proposing to remove the existing mechanism of detecting nodes going down, i think it’s important to explain how this will be handled henceforth in your design.

The server talking to an alien mom will not be a deliverable for this work. But I have already attempted to explain how the sharding of initial dialogue exchange is achieved. The server will save the required information into the database so that another server can make use of it. Let me know if you have any questions.

pbs_comm make use of TCP keepalive and tcp_user_timeout for detecting dropped connection.

Ok, I guess it can be part of the design that talks about launching multiple servers.

If you are changing the way a node going down will be detected then please provide these details in your design as well, right now the doc just mentions that the heartbeat protocol will be removed, it doesn’t explain what it will be replaced with.

I see what you are talking about. The rest of the server to mom communication is more tied to this design although it can be delivered using a separate PR. I’ll update the design with the required details.

The usage of TCP keep_alive and tcp_user_timeout with pbs_comm are already documented interfaces. As per my understanding, IS_NULL is not used to determine mom downstate.

I’ve updated the design with the details. Please have a look.

Thanks for updating the doc, looks better now!

@subhasisb can you please confirm this?

According to @subhasisb, heartbeat was helpful with UDP. It is not required with TCP connections.

Great, thanks for clarifying, the doc looks fine to me.

@nithinj, a few questions.

The goal is to enable all daemons and clients to reach out to the server for the connection. This work targets in changing the direction of mom connection.

Is there any change in the connection origin between the Scheduler-Server front? In the first phase of implementations. I understood that the current goal is making the connection persistent rather than on-demand as per the Scheduler design page. Link for EDD.. Even the diagram needs an update.

Shall we talk about Mom-Server connection reversal on this page and how we are implementing it? Since there is already a page that explains the correct goal.

The introductory diagram serves an educational purpose of how the connections are handled now and where we are headed and not in reference to any documents you have mentioned. We are taking the first step by reversing the mom connections and this discussion is targeted for that as you have pointed out.

I believe the educational purpose should talk about the concrete decision which we made for future releases or at-least in adherence to the next immediate release. If we already decided that Scheduler-Server connection reversal, please share the page which confirms it. If we don’t have any page as such, we should fork that discussion out and get the feedback from the community before stating any future goals.

The point I am making is, we should propose the change in architecture in advance, rather declaring it a future goal.

It will be discussed in a separate thread. Quoting from the page - "This work targets in changing the direction of mom ". Also updated page to clarify that is a future vision.

If this work is targeted only for Mom-Server connection, we should talk about that work and its implications on this page.

And thanks for confirming that the “Scheduler-Server” connection reversal is another project and also it will be discussed separately. Then this goal should be shown after the implementation of current work, not the other project’s work.

Thanks for updating the document, please update the diagram furthermore. That diagram may lead a confusion.

Either refer that page here or remove that part or update the title for combining both the projects. A piece of educational information should be rational about the work we are doing here.