Failover time very long and job IDs lost

tuthmose · September 20, 2023, 6:01am

Hi all,

a few days ago we a PBS 22.05 installation on Rocky 8.8 switched from primary to secondary server because primary VM was having problem. The mechanism worked but it took a very long time to finish, about 30 minutes. When I manually tested it a few months ago (with about half of the job present in logs or db) it took the 7-10 minutes that are also given as a rule of thumb on the docs.
The spool area is a shared NFS mounted with rw,sync,soft options.
What can be the source of the slowdown? Should I expect it to worsen over time?
After we solved problems the primary wen active once again and we still had to wait about half an hour.

Also, during the fail over, some jobs changed prefix i. e. before the fail over a job could be named

12345.primary.mydns

while after it is

56789.secondary.mydns

if I now try to inquire about the second job with qstat I got “Unknown job”. It is present in the logs.
What happened?

Thank you for any support

adarsh · September 24, 2023, 8:20pm

Might be some of these factors

Job history
Jobs in the queue
Performance of the disk hosting $PBS_HOME ( SSD or networked drive or SAN )
Please note the job id will always be .PBS_PRIMARY ( even when the secondary is active)

There would be some interval, where the client commands would not be rendered. However, once the services are recovered and nodes know that now secondary is the active server, there will be some client command query disruption.

tuthmose · September 25, 2023, 7:21am

Hi Adarsh,

thank you for your reply. About point 1 do you suggest to disable job history or reduce log verbosity? If this help it would point to problems with point 3.
About point 2 there were about 200 hundred jobs; I’d guess it’s not a big load; primary and secondary are VMs with 4 cores and 2 GB of RAM.

About query disruption, what do you mean?

adarsh · September 25, 2023, 8:39am

Please check this guide G-370 and 371 PBS Professional 2022.1 Administrator’s Guide

and

tuthmose · September 27, 2023, 5:58am

Hi Adarsh,

thank you for the note about /tmp/.pbsrc.UID which I did not know or remember. As for the failover time which the AG too estimates to a few minutes there is evidently something awry with our config.
I’ll keep you posted when/if there’s a chance to investigate.

tuthmose · November 7, 2023, 3:23pm

Hi,

to provide a follow up: it was a trivial problem: the NFS area was mounted with sync; my fault (I even wrote it in the first message). Now the failover is very fast, within a few minutes.

Thank you

Topic		Replies	Views
Primary server unseccessfully takes over again Users/Site Administrators	5	26	May 27, 2025
Secondary server does manage job dependency Users/Site Administrators	6	63	July 14, 2024
Primary server take over after failover Users/Site Administrators	0	264	November 7, 2023
Reset jobID to zero Users/Site Administrators	1	1463	October 25, 2018
Jobs fail with more than 1 per node Users/Site Administrators	1	397	July 29, 2021

Failover time very long and job IDs lost

Related topics