Qstat error: Unknown Job after fallback

I have an installation based on two head nodes (fallback, HA mode):

  • pbs01
  • pbs02

I’m running:

qstat --version
pbs_version = 22.05.11

on

Distributor ID: Rocky
Description:    Rocky Linux release 8.8 (Green Obsidian)
Release:        8.8
Codename:       GreenObsidian

This is the configuration of the two head nodes:

[root@pbs01]# cat /etc/pbs.conf
PBS_COMM_THREADS=4
PBS_EXEC=/opt/pbs
PBS_HOME=/mnt/pbs_share/pbs_spool
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_SERVER=pbs01.priv.bar.lan
PBS_SECONDARY=pbs02.priv.bar.lan
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
PBS_COMM_LOG_EVENTS=1025

[root@pbs02]#  cat /etc/pbs.conf
PBS_COMM_THREADS=4
PBS_EXEC=/opt/pbs
PBS_HOME=/mnt/pbs_share/pbs_spool
PBS_START_SERVER=1
PBS_START_SCHED=0
PBS_START_COMM=1
PBS_START_MOM=0
PBS_SERVER=pbs01.priv.bar.lan
PBS_PRIMARY=pbs01.priv.bar.lan
PBS_SECONDARY=pbs02.priv.bar.lan
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
PBS_COMM_LOG_EVENTS=1025

The master node is pbs01.
Usually the submitted jobs get a job id made up of an unique number and a suffix, e.g the job 91964 is named 91964.pbs01.priv.bar.lan

$ qstat -f 91964
Job Id: 91964.pbs01.priv.bar.lan

Few days ago i got a failure in the pbs01 node and pbs02 was elected and it started serving requests. All the new submitted jobs got the following job id suffix .pbs02.priv.bar.lan (e.g. 90231). 90231.pbs02.priv.bar.lan
The suffix changed from pbs01.priv.bar.lan to pbs02.priv.bar.lan

A lot of jobs got submitted and in the meanwhile the issue affecting pbs01 was solved. The system switched back from pbs02 to pbs01.

Now we are stuck because we cannot interact with any jobs whose suffix starts with “pbs02.priv.bar.lan”.
When we try to qstat / qdel / qalter one of these jobs we got the following errors.

$ qstat 90231
qstat: Unknown Job Id 90231.pbs01.priv.bar.lan


$ qstat 90231.pbs02
Connection refused

qstat: cannot connect to server pbs01.priv.bar.lan (errno=15010)

$ qstat 90231.pbs02.priv.bar.lan
Connection refused

qstat: cannot connect to server pbs01.bar.bar.lan (errno=15010)

 
$ qstat 90231.pbs01.priv.bar.lan
qstat: Unknown Job Id 90231.pbs01.priv.bar.lan

Do you have any suggestion on how to address this issue ?

Thank you in advance for the help.

Have you tried this?

qstat 90231.pbs02.priv.bar.lan@pbs01.priv.bar.lan

That is, use the original job identifier @ current server.

1 Like

I want to express my gratitude for your help, you solved me a lot of problems.

I apologize for having wasted your time, but I wasn’t familiar with that syntax.

Thanks!