Failover time very long and job IDs lost

Hi all,

a few days ago we a PBS 22.05 installation on Rocky 8.8 switched from primary to secondary server because primary VM was having problem. The mechanism worked but it took a very long time to finish, about 30 minutes. When I manually tested it a few months ago (with about half of the job present in logs or db) it took the 7-10 minutes that are also given as a rule of thumb on the docs.
The spool area is a shared NFS mounted with rw,sync,soft options.
What can be the source of the slowdown? Should I expect it to worsen over time?
After we solved problems the primary wen active once again and we still had to wait about half an hour.

Also, during the fail over, some jobs changed prefix i. e. before the fail over a job could be named

12345.primary.mydns

while after it is

56789.secondary.mydns

if I now try to inquire about the second job with qstat I got “Unknown job”. It is present in the logs.
What happened?

Thank you for any support

Might be some of these factors

  1. Job history
  2. Jobs in the queue
  3. Performance of the disk hosting $PBS_HOME ( SSD or networked drive or SAN )
  4. Please note the job id will always be .PBS_PRIMARY ( even when the secondary is active)

There would be some interval, where the client commands would not be rendered. However, once the services are recovered and nodes know that now secondary is the active server, there will be some client command query disruption.

Hi Adarsh,

thank you for your reply. About point 1 do you suggest to disable job history or reduce log verbosity? If this help it would point to problems with point 3.
About point 2 there were about 200 hundred jobs; I’d guess it’s not a big load; primary and secondary are VMs with 4 cores and 2 GB of RAM.

About query disruption, what do you mean?

Please check this guide G-370 and 371 PBS Professional 2022.1 Administrator’s Guide


and

Hi Adarsh,

thank you for the note about /tmp/.pbsrc.UID which I did not know or remember. As for the failover time which the AG too estimates to a few minutes there is evidently something awry with our config.
I’ll keep you posted when/if there’s a chance to investigate.

Hi,

to provide a follow up: it was a trivial problem: the NFS area was mounted with sync; my fault (I even wrote it in the first message). Now the failover is very fast, within a few minutes.

Thank you

1 Like