Primary server unseccessfully takes over again

I am testing the failover feature. During the job execution, I shut down the primary server (systemctl stop pbs), the secondary server took over after a delay, and I could see the pbs_sched process started on the secondary server. After the job execution was completed, I restarted the primary server (systemctl start pbs), and then submitted the next job on the primary server, but the job kept being queued, and the following error kept repeating in the sched log:


It should not be the normal behavior for the secondary server to detect the activity of the primary server and then change itself back to the idle state. But it can’t work normally now.

  • stop the secendory PBS Server first , make sure the pbs_sched is not running. Then start the primary server , wait until the services have started and you are sure that qstat -Bf , pbsnodes -av works. Then start the secondary PBS server. Please note while doing these tests you would need to allow some time , doing it in quick sessions might get into issues.Also, always start the scheduler once the pbs server is up and running.

This operation obviously becomes more complicated.
I tried it, and the job stayed in the queued state for a very long time before changing to the running state. The delay time is very terrible.
The expected startup behavior did not take effect.



There will be time delay - 5 minutes i suppose
Also, nodes will be informed to connect to primary , so that change also takes some time.

Is this inevitable?
And why does the comment disapper when the secondary server takes over?

netstat -antlp | grep pbs shows that:


That’s very strange.


image
So the primary server always can’t take over.