The scenario is an edge condition and not that likely once pbspro is working, but does this reveal a more fundamental logical flaw? If pbs_comm is used (ie., a “stock” build/install where pbs_comm is the expected intra-pbs communications service) and then pbs_comm is killed/blocked/whatever, the server neither reports lack of comms with the moms, nor changes mom states, nor tracks actual mom state changes.
Expected behavior: some form of state change for moms with which expected comms fail – down-comms-lost, down-unknown, signal-stolen-by-space-aliens, anything but “no change” and certainly not state=free.
Actual behavior: jobs attempt to place and fail (sister node failed to delete, etc) when in truth no place occurred at all, no mom logs report any attempt to place, no comms problems are reported in the server-side logs, no errors are reported via qstat/pbsnodes/qmgr/etc, and server-reported mom states do not change. Server-side comms appear to be handed off toward pbs_comms “blind”, without follow-up checks to validate the comms attempt.
To reproduce:
Start the 14.1.0 server/sched/comm and at least 1 configured mom so that pbsnodes shows the mom state=free. Kill pbs_comm. The mom stays state=free indefinitely.
Simulate a restart-persistent issue with the comm by editing pbs.conf (or the init/service script) to exclude comm startup, or firewall to drop the pbs_comm outbound port traffic, etc., then restart the server/sched. The server restarts without complaint and continues to report the mom as state=free.
Simulate mom fallout by killing the mom, dropping network connectivity, whatever. The server continues to report the mom as state=free. Compound this with the previous test – with the mom down, restart the server [such that pbs_comm does not start, or does not connect with the mom]. The server again restarts and continues reporting the mom, now effectively non-existent, as state=free.
Back out whatever measures were taken to prevent pbs_comm functionality and [re]start pbs_comm. The server begins reporting the actual mom state and All Is Well ™.