Primary server unseccessfully takes over again

wakaka · May 23, 2025, 4:08am

I am testing the failover feature. During the job execution, I shut down the primary server (systemctl stop pbs), the secondary server took over after a delay, and I could see the pbs_sched process started on the secondary server. After the job execution was completed, I restarted the primary server (systemctl start pbs), and then submitted the next job on the primary server, but the job kept being queued, and the following error kept repeating in the sched log:

It should not be the normal behavior for the secondary server to detect the activity of the primary server and then change itself back to the idle state. But it can’t work normally now.

adarsh · May 23, 2025, 6:51am

stop the secendory PBS Server first , make sure the pbs_sched is not running. Then start the primary server , wait until the services have started and you are sure that qstat -Bf , pbsnodes -av works. Then start the secondary PBS server. Please note while doing these tests you would need to allow some time , doing it in quick sessions might get into issues.Also, always start the scheduler once the pbs server is up and running.

wakaka · May 23, 2025, 8:52am

This operation obviously becomes more complicated.
I tried it, and the job stayed in the queued state for a very long time before changing to the running state. The delay time is very terrible.
The expected startup behavior did not take effect.

adarsh · May 23, 2025, 6:30pm

There will be time delay - 5 minutes i suppose
Also, nodes will be informed to connect to primary , so that change also takes some time.

wakaka · May 24, 2025, 2:30am

Is this inevitable?
And why does the comment disapper when the secondary server takes over?

wakaka · May 27, 2025, 12:04pm

netstat -antlp | grep pbs shows that:

That’s very strange.

So the primary server always can’t take over.

Topic		Replies	Views
Primary server take over after failover Users/Site Administrators	0	265	November 7, 2023
Failover time very long and job IDs lost Users/Site Administrators	5	340	November 7, 2023
Secondary server does manage job dependency Users/Site Administrators	6	67	July 14, 2024
Reasonable stonith script Users/Site Administrators	7	30	May 23, 2025
Failover Setup Issues Users/Site Administrators	10	3813	April 10, 2019

Primary server unseccessfully takes over again

Related topics