I am testing a PBS configuration with failover; PBS version is 22.05.11 on Rocky Linux 8.8. Here’s the situation:
- The primary server is active and the secondary “watches”.
- I stop the primary (either just stop the PBS service or shutdown the node)
- The secondary becomes active; failover is smooth and fast.
- I make a few submission test then I restart the primary
- The failover is once again fast but when the scheduler comes up it says
pbs_sched;Svr;pbs_sched;Invalid request (15004) in open_server_conns, Couldn't register the scheduler default with connected server
There is a short time interval during the takeover process by the primary in which I can observe the scheduler PID on both servers than the one running on the secondary is stopped. This apparently sends the scheduler on the primary in a sort of loop (the above error is repeated over and over) until I kill and start the scheduler process manually. I also get this other error
Svr;pbs_sched;Scheduler already connected (15230) in open_server_conns, Couldn't register the scheduler default with connected server.
This may be a problem in a situation in which (i) there is a failover and then (ii) the primary gets back automatically. Or I am missing something and this is an expected behaviour?
For experimentation I have now modified the
pre_start_pbs function in
pbs_init.d in the following way (just the same of the manual kill):
633 start_pbs 634 echo "a comment" 635 stop_pbs 636 start_pbs
in conclusion I have three questions:
- Primary scheduler in loop is expected?
- Can depend on some sync problem on the DB
- Assuming I cannot solve the problem, there’s a more sensible workaround?
thank you in advance for your help