How to configure a "Failover setup" using two PBS Pro master?

periasamy21 · February 7, 2018, 7:32am

Hi All,

I am new to PBS pro and i am able to install and configure a single master node and compute node cluster successfully and job was running on compute node,

How to configure Fail-over Setup using two PBS Pro Server?

It would be helpful if you can share the steps to configure Fail-over Setup of PBS Pro Server.

Fail-over setup Environment details as follows:-

VM1

----------------------------|
Primary PBS Server |------------|-eth0(10.0.0.1/24)
----------------------------| |
VM2 |
--------------------------------| |
Secondary PBS Server |--------|—eth0(10.0.0.2/24)
--------------------------------| |
VM3 |
----------------------------| |
NFS and NIS server |------------|-eth0(10.0.0.254/24)
----------------------------| |
VM4 |
----------------------------| |
Execution Host1 |------------|-eth0(10.0.0.3/24)
----------------------------| |
VM5 |
-------------------------| |
Execution Host2 |--------------|-eth0(10.0.0.4/24)
-------------------------|

Thank You,
Periasamy

scott · February 7, 2018, 4:38pm

Have you reviewed the PBS Professional Admin Guide v14.2 Section 8 for the details of using PBS Professional’s Fail-Over configuration?

OR, are you asking how to configure PBS Professional in a High Availability environment (e.g., RHCS)?

periasamy21 · February 16, 2018, 6:12am

I am asking how to configure PBS Professional in a High Availability environment (e.g., RHCS)?

alexis.cousein · March 14, 2018, 9:33pm

You can set it up as you would any service that has LSB scripts (with correct dependencies on filesystem and IP aliases, but:

-the LSB scripts shipped are not robust (do not return the correct codes all the time)
-get confused when the PBS_HOME is not mounted
-call PBSPro commands, unwise when a server that’s unresponsive is the reason for calling the script.

so you need better ones. I’ll post some shortly.

The ones for the server I’ll post do better than just see if processes exist, they use qstat -Bf to ensure the server is also responsive. But that also means that monitoring timeouts should be taken as long (60 seconds) to avoid triggering spurious failovers on any quasihang (DNS lookups, server side hooks etc.), and that the “start” timeout needs to be VERY long if you have lots of jobs in qstat -x output (recovering jobs can take 10-15 minutes at sites that have millions of jobs, especially if you don’t tune the datastore postgresql.conf).

You’d also better split off pbs_comm and the scheduler – these are effectively separate services, and pbs_comm is active/active and needs to become a clone set.

I have a document that describes most of this.

alexis.cousein · March 14, 2018, 9:37pm

Oops – can’t post PDF files and scripts. Lemme see with the admins how to get around this.

billnitzberg · March 20, 2018, 12:44am

Hi @alexis.cousein – not sure whether it will help, but I increased the trust level associated with your login. (Discourse does all sorts of things to keep spam off the site; one of the things it does is restrict what “new users” are allowed to do.). Let me know if it works; if not, I’ll look for other settings to try. Thx!

mkaro · March 20, 2018, 7:12pm

I also added a space in Confluence where files may be uploaded and shared…
https://pbspro.atlassian.net/wiki/spaces/PBSPro/pages/269680659/Attachments

Topic		Replies	Views
How to configure the backup service Users/Site Administrators	6	736	April 16, 2021
Failover Setup Issues Users/Site Administrators	10	3814	April 10, 2019
Neither primary or secondary server Users/Site Administrators	6	998	July 30, 2021
PBS_PRIMARY/PBS_SECONDARY vs PBS_LEAF_NAME Users/Site Administrators	6	1171	July 29, 2021
PBS Failover with Pacemaker and Corosync Developers	0	837	September 22, 2020

How to configure a "Failover setup" using two PBS Pro master?

Related topics