I am testing the failover feature. During the job execution, I shut down the primary server (systemctl stop pbs
), the secondary server took over after a delay, and I could see the pbs_sched process started on the secondary server. After the job execution was completed, I restarted the primary server (systemctl start pbs
), and then submitted the next job on the primary server, but the job kept being queued, and the following error kept repeating in the sched log:
It should not be the normal behavior for the secondary server to detect the activity of the primary server and then change itself back to the idle state. But it can’t work normally now.