I have an installation based on two head nodes (fallback, HA mode):
qstat --version pbs_version = 22.05.11
Distributor ID: Rocky Description: Rocky Linux release 8.8 (Green Obsidian) Release: 8.8 Codename: GreenObsidian
This is the configuration of the two head nodes:
[root@pbs01]# cat /etc/pbs.conf PBS_COMM_THREADS=4 PBS_EXEC=/opt/pbs PBS_HOME=/mnt/pbs_share/pbs_spool PBS_START_SERVER=1 PBS_START_SCHED=1 PBS_START_COMM=1 PBS_START_MOM=0 PBS_SERVER=pbs01.priv.bar.lan PBS_SECONDARY=pbs02.priv.bar.lan PBS_CORE_LIMIT=unlimited PBS_SCP=/usr/bin/scp PBS_COMM_LOG_EVENTS=1025 [root@pbs02]# cat /etc/pbs.conf PBS_COMM_THREADS=4 PBS_EXEC=/opt/pbs PBS_HOME=/mnt/pbs_share/pbs_spool PBS_START_SERVER=1 PBS_START_SCHED=0 PBS_START_COMM=1 PBS_START_MOM=0 PBS_SERVER=pbs01.priv.bar.lan PBS_PRIMARY=pbs01.priv.bar.lan PBS_SECONDARY=pbs02.priv.bar.lan PBS_CORE_LIMIT=unlimited PBS_SCP=/usr/bin/scp PBS_COMM_LOG_EVENTS=1025
The master node is pbs01.
Usually the submitted jobs get a job id made up of an unique number and a suffix, e.g the job 91964 is named 91964.pbs01.priv.bar.lan
$ qstat -f 91964 Job Id: 91964.pbs01.priv.bar.lan
Few days ago i got a failure in the pbs01 node and pbs02 was elected and it started serving requests. All the new submitted jobs got the following job id suffix .pbs02.priv.bar.lan (e.g. 90231). 90231.pbs02.priv.bar.lan
The suffix changed from pbs01.priv.bar.lan to pbs02.priv.bar.lan
A lot of jobs got submitted and in the meanwhile the issue affecting pbs01 was solved. The system switched back from pbs02 to pbs01.
Now we are stuck because we cannot interact with any jobs whose suffix starts with “pbs02.priv.bar.lan”.
When we try to qstat / qdel / qalter one of these jobs we got the following errors.
$ qstat 90231 qstat: Unknown Job Id 90231.pbs01.priv.bar.lan $ qstat 90231.pbs02 Connection refused qstat: cannot connect to server pbs01.priv.bar.lan (errno=15010) $ qstat 90231.pbs02.priv.bar.lan Connection refused qstat: cannot connect to server pbs01.bar.bar.lan (errno=15010) $ qstat 90231.pbs01.priv.bar.lan qstat: Unknown Job Id 90231.pbs01.priv.bar.lan
Do you have any suggestion on how to address this issue ?
Thank you in advance for the help.