Primary server take over after failover

Hi all,

I am testing a PBS configuration with failover; PBS version is 22.05.11 on Rocky Linux 8.8. Here’s the situation:

  1. The primary server is active and the secondary “watches”.
  2. I stop the primary (either just stop the PBS service or shutdown the node)
  3. The secondary becomes active; failover is smooth and fast.
  4. I make a few submission test then I restart the primary
  5. The failover is once again fast but when the scheduler comes up it says pbs_sched;Svr;pbs_sched;Invalid request (15004) in open_server_conns, Couldn't register the scheduler default with connected server

There is a short time interval during the takeover process by the primary in which I can observe the scheduler PID on both servers than the one running on the secondary is stopped. This apparently sends the scheduler on the primary in a sort of loop (the above error is repeated over and over) until I kill and start the scheduler process manually. I also get this other error Svr;pbs_sched;Scheduler already connected (15230) in open_server_conns, Couldn't register the scheduler default with connected server.

This may be a problem in a situation in which (i) there is a failover and then (ii) the primary gets back automatically. Or I am missing something and this is an expected behaviour?

For experimentation I have now modified the pre_start_pbs function in pbs_init.d in the following way (just the same of the manual kill):

633              start_pbs
634              echo "a comment"
635              stop_pbs
636              start_pbs

in conclusion I have three questions:

  1. Primary scheduler in loop is expected?
  2. Can depend on some sync problem on the DB
  3. Assuming I cannot solve the problem, there’s a more sensible workaround?

thank you in advance for your help