Communication trouble between pbs_comm, server and MoM when node IP updated

Hi,

Context:
OpenPBS 19.1.3

I have “dynamic” nodes backed by virtual instances that when started will have a different IP and MoM each time despite still being the same node for PBS (identified via a hostname that won’t change).

When a restart of those dynamic nodes happens, I need to make sure PBS knows about this IP change and resolve the address proprely.

From what I saw, after restarting my nodes, it seems that one of these things can happens:

  • PBS might try to answer on the wrong IP address (the old one) when the MoM send its registration request and I get a “Dest not found” error in pbs_comm logs.
    This happens for example when the “Mom” attribute of a node end up being set automatically with the old IP address for some reason instead of keeping the original hostname value. Instead of using the new IP related to the hostname it will try to contact the old one.
    As shown below, the MoM register as a leaf with a new IP 10.0.10.88 and the server (10.0.10.253) make an attempt to answer to the old address 10.0.10.240

    Leaf registered address 10.0.10.88:15003
    Pkt from src=10.0.10.253:15001[5], noroute to dest=10.0.10.240:15003, pbs_comm:10.0.10.253:17001: Dest not found at pbs_comm
    
  • if the IP change was detected, PBS still seems to consider the MoM as already registered / known by the server and it does not send back the “hello” to the MoM. This leads to the new MoM not communicating as it should despite being “registered”. I can see that because the leaf is registered but I do not have any error in pbs_comm logs and no “Hello” in the MoM logs either.

Questions:

  • is there a way for me to know what are the currently registered MoMs known by pbs_comm and the server ?
    Can I get this information without having to parse the pbs_comm logs myself ?
    That would be useful to identify what nodes have MoM registered but no communication with it and decide if I need to force a restart of PBS that will force the connections to be reset.

  • is there a way for me to “reset” the “Mom” attribute so that my nodes use the hostname properly ?
    This is a big issue for me because once the “Mom” attribute is set with the IP instead of the hostname it seems that my node will never be able to connect to any replacement virtual instance as it seems to ignore the hostname and just use that IP address to communicate back with the MoM.

  • am I missing something ? Is there a way to “clear” the server known MoM to reset all this when problems occurs ?