Primary server take over after failover

tuthmose · November 7, 2023, 4:00pm

Hi all,

I am testing a PBS configuration with failover; PBS version is 22.05.11 on Rocky Linux 8.8. Here’s the situation:

The primary server is active and the secondary “watches”.
I stop the primary (either just stop the PBS service or shutdown the node)
The secondary becomes active; failover is smooth and fast.
I make a few submission test then I restart the primary
The failover is once again fast but when the scheduler comes up it says pbs_sched;Svr;pbs_sched;Invalid request (15004) in open_server_conns, Couldn't register the scheduler default with connected server

There is a short time interval during the takeover process by the primary in which I can observe the scheduler PID on both servers than the one running on the secondary is stopped. This apparently sends the scheduler on the primary in a sort of loop (the above error is repeated over and over) until I kill and start the scheduler process manually. I also get this other error Svr;pbs_sched;Scheduler already connected (15230) in open_server_conns, Couldn't register the scheduler default with connected server.

This may be a problem in a situation in which (i) there is a failover and then (ii) the primary gets back automatically. Or I am missing something and this is an expected behaviour?

For experimentation I have now modified the pre_start_pbs function in pbs_init.d in the following way (just the same of the manual kill):

633              start_pbs
634              echo "a comment"
635              stop_pbs
636              start_pbs

in conclusion I have three questions:

Primary scheduler in loop is expected?
Can depend on some sync problem on the DB
Assuming I cannot solve the problem, there’s a more sensible workaround?

thank you in advance for your help

Topic		Replies	Views
Primary server unseccessfully takes over again Users/Site Administrators	5	28	May 27, 2025
Neither primary or secondary server Users/Site Administrators	6	998	July 30, 2021
Wrong scheduler paths and failover problems Users/Site Administrators	0	279	August 14, 2023
Failover Setup Issues Users/Site Administrators	10	3806	April 10, 2019
Secondary server does manage job dependency Users/Site Administrators	6	66	July 14, 2024

Primary server take over after failover

Related topics