a few days ago we a PBS 22.05 installation on Rocky 8.8 switched from primary to secondary server because primary VM was having problem. The mechanism worked but it took a very long time to finish, about 30 minutes. When I manually tested it a few months ago (with about half of the job present in logs or db) it took the 7-10 minutes that are also given as a rule of thumb on the docs.
The spool area is a shared NFS mounted with
What can be the source of the slowdown? Should I expect it to worsen over time?
After we solved problems the primary wen active once again and we still had to wait about half an hour.
Also, during the fail over, some jobs changed prefix i. e. before the fail over a job could be named
while after it is
if I now try to inquire about the second job with
qstat I got “Unknown job”. It is present in the logs.
Thank you for any support
Might be some of these factors
- Job history
- Jobs in the queue
- Performance of the disk hosting $PBS_HOME ( SSD or networked drive or SAN )
- Please note the job id will always be .PBS_PRIMARY ( even when the secondary is active)
There would be some interval, where the client commands would not be rendered. However, once the services are recovered and nodes know that now secondary is the active server, there will be some client command query disruption.
thank you for your reply. About point 1 do you suggest to disable job history or reduce log verbosity? If this help it would point to problems with point 3.
About point 2 there were about 200 hundred jobs; I’d guess it’s not a big load; primary and secondary are VMs with 4 cores and 2 GB of RAM.
About query disruption, what do you mean?
Please check this guide G-370 and 371 PBS Professional 2022.1 Administrator’s Guide
thank you for the note about
/tmp/.pbsrc.UID which I did not know or remember. As for the failover time which the AG too estimates to a few minutes there is evidently something awry with our config.
I’ll keep you posted when/if there’s a chance to investigate.
to provide a follow up: it was a trivial problem: the NFS area was mounted with
sync; my fault (I even wrote it in the first message). Now the failover is very fast, within a few minutes.