I am new to PBS pro and i am able to install and configure a single master node and compute node cluster successfully and job was running on compute node,
How to configure Fail-over Setup using two PBS Pro Server?
It would be helpful if you can share the steps to configure Fail-over Setup of PBS Pro Server.
You can set it up as you would any service that has LSB scripts (with correct dependencies on filesystem and IP aliases, but:
-the LSB scripts shipped are not robust (do not return the correct codes all the time)
-get confused when the PBS_HOME is not mounted
-call PBSPro commands, unwise when a server that’s unresponsive is the reason for calling the script.
so you need better ones. I’ll post some shortly.
The ones for the server I’ll post do better than just see if processes exist, they use qstat -Bf to ensure the server is also responsive. But that also means that monitoring timeouts should be taken as long (60 seconds) to avoid triggering spurious failovers on any quasihang (DNS lookups, server side hooks etc.), and that the “start” timeout needs to be VERY long if you have lots of jobs in qstat -x output (recovering jobs can take 10-15 minutes at sites that have millions of jobs, especially if you don’t tune the datastore postgresql.conf).
You’d also better split off pbs_comm and the scheduler – these are effectively separate services, and pbs_comm is active/active and needs to become a clone set.
Hi @alexis.cousein – not sure whether it will help, but I increased the trust level associated with your login. (Discourse does all sorts of things to keep spam off the site; one of the things it does is restrict what “new users” are allowed to do.). Let me know if it works; if not, I’ll look for other settings to try. Thx!