Failover Setup Issues

Hi,

Below are the details on how I had worked on Failover Setup.

In Open Stack I had created four instances
Instance1 - Primary Server(PS)
Instance2 - Secondary Server(SS)
Instance3 - Node1(i.e execution host)
Instance4 - NFS Server

Case1: When NFS is not being used.
I had configured Primary Server, Secondary Server and Node1 as per the details mentioned in PBS Admin Guide Section 9 for failover setup. PS, SS and Node1 are able to see each other.
At first I had submitted some jobs on the PS and they were executed successfully. Secondly I had stopped PBS on PS within few seconds I could notice that SS has taken over the control and its is unable to execute the jobs. This signifies we need NFS

Case2: When NFS is being used
-Created a PBS_HOME directory which is now hard mounted on PS, SS, Node1 and NFS Server.
-Stopped PBS on PS, SS and Node1
-Did path changes in /etc/pbs.conf for pointing PBS_HOME variable to Shared File System path
-Copied all the files from /var/spool/pbs to /PBS_HOME/
-Configurations for /etc/hosts is also in sync

When I start PBS its unable to start neither on PS, SS and Node1
Below is the error message:
Mar 29 19:30:55 primaryserver.novalocal pbs_init.d[17147]: pbs_sched startup failed, exit 1 aborting.
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: pbs.service: control process exited, code=exited status=1
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: Failed to start Portable Batch System.
ā€“ Subject: Unit pbs.service has failed
ā€“ Defined-By: systemd
ā€“ Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel

ā€“ Unit pbs.service has failed.

ā€“ The result is failed.
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: Unit pbs.service entered failed state.
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: pbs.service failed.
Mar 29 19:30:55 primaryserver.novalocal polkitd[3253]: Unregistered Authentication Agent for unix-process:17141:1425045 (system bus name :1.132, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnect)

While troubleshooting I had encountered below type of message as well
/etc/init.d/pbs start
Starting PBS
PBS comm
/opt/pbs/sbin/pbs_comm ready (pid=12782), Proxy Name:secondaryserver:17001, Threads:4
pbs_sched: Permission denied (13) in chk_file_sec, Security violation ā€œ/PBS_HOME/pbs/sched_privā€ resolves to ā€œ/PBS_HOMEā€
pbs_sched startup failed, exit 1 aborting.

I had made sure that UID is same for PS, SS, Node1 and NFS Server and also shared file system directory has root privileges.

Any leads will be helpful

For Failover:
Yes shared file system with file locking mechanism is required for PBS Pro failover setup.
Important: Make sure NFS lock services are up and running

Please follow these steps:

Primary Server:

  • install PBS Server with PBS_HOME directory on the NFS
  • submit couple of jobs, to make sure setup works by submitting couple of jobs
  • stop the PBS Services

Secondary Server:

  • install PBS server with PBS_HOME on the local disk
  • stop the PBS Services
  • edit the /etc/pbs.conf and point it to PBS_HOME setup on the NFS
  • start the PBS Services
  • make sure setup works by submitting couple of jobs
  • stop the PBS Services

Note: When starting the services

  • start the primary pbs server first
  • start the second pbs server next

When stopping the services

  • stop the secondary pbs server first
  • stop the primary pbs server next

Caution: Never try to stop and start PBS Services on the primary and secondary in short span of time, always make sure there is some time gap between starting and stopping (vice versa) the services.

It seems the file/folder permissions of the PBS_HOME and sub directories are disturbed. Please follow the above steps, it would work without any issues.

Hi Adarsh,

Thanks for the details. I will do the setup from scratch and will let you know in case I encounter any issues.

Thanks,
Rakhen

Hi Adarsh,

Below are the steps that I had followed for configuring failover:

1.Created NFS Server with the shared directory named as PBS_HOME
2.Hard mounted shared NFS folder on primary server instance
3.Installed PBS Pro server package by pointing to shared folder.

When I start PBS Server I am facing the below issue
image

Rakhen,

Before the installation did you update the /etc/pbs.conf for PBS_HOME to point to the NFS location .
Are you sure you removed the remnants of the previous installation ?

Note: The location of PBS_HOME is specified in the file /etc/pbs.conf, but defaults to /var/spool/pbs if not specified. The default for PBS_EXEC is /opt/pbs. You can specify a non-default location for PBS_EXEC via the --prefix option to rpm when installing the new PBS.

Thank you

Hi Adarsh,

I did a fresh start of the entire setup. Actually I had exported the value for PBS_HOME so that it points to the NFS location. As I could see before installing the server package there isnā€™t any /etc/pbs.conf present.

Thanks

Hi Adarsh,

NFS Server Setup:

vi /etc/selinux/config (Changed to disabled)
yum -y install nfs-utils libnfsidmap
systemctl enable rpcbind
systemctl start nfs-server
systemctl start rpc-statd
systemctl start nfs-idmapd
systemctl start nfslock
mkdir /PBS_HOME
chmod 777 /PBS_HOME/
vi /etc/exports Adding the text in () to /etc/exports (/PBS_HOME *(rw,sync,no_root_squash))
exportfs -r
showmount -e localhost
systemctl restart nfs-server

PrimaryServer Setup

  1. Setting up NFS client on Primary Server
    vi /etc/selinux/config (Make sure selinux is disabled)
    yum -y install nfs-utils libnfsidmap
    systemctl enable rpcbind
    systemctl start rpcbind
    systemctl enable nfslock
    systemctl enable nfs-server
    systemctl start nfs-server
    systemctl enable nfs-idmap
    systemctl start nfs-idmap
    systemctl start nfslock
    mkdir /PBS_HOME
    mount -o rw,hard,intr :/PBS_HOME /PBS_HOME ----For Hard mounting
    df -kh
    cd /PBS_HOME/
    ll
    touch test
    ll
    vi /etc/fstab Adding the text in () in /etc/fstab (:/PBS_HOME /PBS_HOME nfs rw,sync,hard,intr 0 0)
    cd
    umount /PBS_HOME/
    mount -av
    df -kh
    reboot

  2. Installing Server Package
    useradd --system -m pbsdata
    wget https://github.com/PBSPro/pbspro/releases/download/v18.1.4/pbspro_1.8.4.centos7.zip
    unzip pbspro_1.8.4.centos7.zip
    cp pbspro_1.8.4.centos7/pbspro-server-18.1.4-0.x86_64.rpm ~
    export PBS_DATA_SERVICE_USER=pbsdata
    export PBS_HOME=/PBS_HOME
    yum install pbspro-server-18.1.4-0.x86_64.rpm
    vi /etc/pbs.conf (Check whether parameters are in thsi way
    PBS_EXEC=/opt/pbs
    PBS_HOME=/PBS_HOME
    PBS_SERVER=primaryserver
    PBS_START_SERVER=1
    PBS_START_SCHED=1
    PBS_START_COMM=1
    PBS_START_MOM=0
    PBS_CORE_LIMIT=unlimited
    PBS_SCP=/bin/scp
    )
    vi /PBS_HOME/mom_priv/config (CHANGE_THIS_TO_PBS_PRO_SERVER_HOSTNAME to your respective hostname)
    vi /etc/hosts (Add your hostname)
    /etc/init.d/pbs start

[root@primaryserver ~]# /etc/init.d/pbs start
Starting PBS
PBS Home directory /PBS_HOME needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.


*** Error initializing the PBS dataservice
Error details:
Creating the PBS Data Serviceā€¦
Starting PBS Data Serviceā€¦
pg_ctl: could not start server
Examine the log output.
Failed to start PBS Data Service
Error starting PBS Data Service
rm: cannot remove ā€˜/PBS_HOME/datastoreā€™: Directory not empty

[root@primaryserver ~]# cd /PBS_HOME/
[root@primaryserver PBS_HOME]# ls -l
total 4
drwxr-xr-x 2 root root 6 Apr 3 14:16 aux
drwx------ 2 root root 6 Apr 3 14:16 checkpoint
drwxr-xr-x 2 root root 6 Apr 3 14:16 comm_logs
drwx------ 2 pbsdata root 42 Apr 3 17:22 datastore
drwxr-xr-x 2 root root 6 Apr 3 14:16 mom_logs
drwxr-xā€“x 4 root root 45 Apr 3 14:19 mom_priv
-rw-rā€“r-- 1 root root 19 Apr 3 14:16 pbs_environment
drwxr-xr-x 2 root root 6 Apr 3 14:16 sched_logs
drwxr-xā€” 2 root root 86 Apr 3 14:16 sched_priv
drwxr-xr-x 2 root root 6 Apr 3 14:16 server_logs
drwxr-xā€” 6 root root 77 Apr 3 14:16 server_priv
drwxrwxrwt 2 root root 6 Apr 3 17:22 spool
drwxrwxrwt 2 root root 6 Apr 3 14:16 undelivered

I have even tried chown postgres:root datastore. When I start PBS again its reverted back to pbsdata and fails as well. Also removed datastore and started again still same issue persists.

Any idea where I am wrong?

Thanks

Tested your configuration a minute ago and it worked for me:

[root@headnode ~]# cat /etc/exports
/home   *(rw,sync,no_root_squash,no_subtree_check)
/app    *(rw,sync,no_root_squash,subtree_check) 

[root@headnode ~]# cat /etc/redhat-release 
CentOS Linux release 7.6.1810 (Core) 

[root@headnode ~]#ls -ltr /
drwxr-xr-x   17 root root   295 Apr  5 17:39 app



[root@n1 pbspro_1.8.4.centos7]# cat /etc/fstab 

#
# /etc/fstab
# Created by anaconda on Tue Aug 14 10:37:36 2018
#
# Accessible filesystems, by reference, are maintained under '/dev/disk'
# See man pages fstab(5), findfs(8), mount(8) and/or blkid(8) for more info
#
/dev/mapper/centos-root /                       xfs     defaults        0 0
UUID=5986bc7d-5e77-4e9a-90b3-257778d6318f /boot                   xfs     defaults        0 0
/dev/mapper/centos-swap swap                    swap    defaults        0 0
headnode:/app   /app  nfs defaults 0 0 

[root@n1 pbspro_1.8.4.centos7]# ls -ltr /
drwxr-xr-x   17 root root   295 Apr  5 17:39 app


[root@n1 ~]# ls
anaconda-ks.cfg  Desktop  Documents  Downloads  initial-setup-ks.cfg  Music  pbspro_1.8.4.centos7  pbspro_1.8.4.centos7.zip  Pictures  Public  Templates  Videos

[root@n1 ~]# cd pbspro_1.8.4.centos7/

[root@n1 pbspro_1.8.4.centos7]# ls
COPYRIGHT  LICENSE  pbspro-client-18.1.4-0.x86_64.rpm  pbspro-debuginfo-18.1.4-0.x86_64.rpm  pbspro-execution-18.1.4-0.x86_64.rpm  pbspro-server-18.1.4-0.x86_64.rpm  README.md




[root@n1 pbspro_1.8.4.centos7]# cat /etc/pbs.conf
PBS_EXEC=/opt/pbs
PBS_HOME=/app
PBS_SERVER=n1
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp


[root@n1 pbspro_1.8.4.centos7]# yum install pbspro-server-18.1.4-0.x86_64.rpm 
Loaded plugins: fastestmirror, langpacks
Examining pbspro-server-18.1.4-0.x86_64.rpm: pbspro-server-18.1.4-0.x86_64
Marking pbspro-server-18.1.4-0.x86_64.rpm to be installed
Resolving Dependencies
--> Running transaction check
---> Package pbspro-server.x86_64 0:18.1.4-0 will be installed
--> Processing Dependency: perl(Env) for package: pbspro-server-18.1.4-0.x86_64
Loading mirror speeds from cached hostfile
 * base: mirror.mhd.uk.as44574.net
 * epel: epel.mirror.wearetriple.com
 * extras: mirror.econdc.com
 * updates: mirror.econdc.com
--> Processing Dependency: perl(Switch) for package: pbspro-server-18.1.4-0.x86_64
--> Processing Dependency: postgresql-server for package: pbspro-server-18.1.4-0.x86_64
--> Processing Dependency: tcl for package: pbspro-server-18.1.4-0.x86_64
--> Processing Dependency: tk for package: pbspro-server-18.1.4-0.x86_64
--> Processing Dependency: libhwloc.so.5()(64bit) for package: pbspro-server-18.1.4-0.x86_64
--> Processing Dependency: libpq.so.5()(64bit) for package: pbspro-server-18.1.4-0.x86_64
--> Processing Dependency: libtcl8.5.so()(64bit) for package: pbspro-server-18.1.4-0.x86_64
--> Processing Dependency: libtk8.5.so()(64bit) for package: pbspro-server-18.1.4-0.x86_64
--> Running transaction check
---> Package hwloc-libs.x86_64 0:1.11.8-4.el7 will be installed
---> Package perl-Env.noarch 0:1.04-2.el7 will be installed
---> Package perl-Switch.noarch 0:2.16-7.el7 will be installed
---> Package postgresql-libs.x86_64 0:9.2.24-1.el7_5 will be installed
---> Package postgresql-server.x86_64 0:9.2.24-1.el7_5 will be installed
--> Processing Dependency: postgresql(x86-64) = 9.2.24-1.el7_5 for package: postgresql-server-9.2.24-1.el7_5.x86_64
---> Package tcl.x86_64 1:8.5.13-8.el7 will be installed
---> Package tk.x86_64 1:8.5.13-6.el7 will be installed
--> Running transaction check
---> Package postgresql.x86_64 0:9.2.24-1.el7_5 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

=====================================================================================================================================================================================
 Package                                    Arch                            Version                                    Repository                                               Size
=====================================================================================================================================================================================
Installing:
 pbspro-server                              x86_64                          18.1.4-0                                   /pbspro-server-18.1.4-0.x86_64                           18 M
Installing for dependencies:
 hwloc-libs                                 x86_64                          1.11.8-4.el7                               base                                                    1.6 M
 perl-Env                                   noarch                          1.04-2.el7                                 base                                                     16 k
 perl-Switch                                noarch                          2.16-7.el7                                 base                                                     22 k
 postgresql                                 x86_64                          9.2.24-1.el7_5                             base                                                    3.0 M
 postgresql-libs                            x86_64                          9.2.24-1.el7_5                             base                                                    234 k
 postgresql-server                          x86_64                          9.2.24-1.el7_5                             base                                                    3.8 M
 tcl                                        x86_64                          1:8.5.13-8.el7                             base                                                    1.9 M
 tk                                         x86_64                          1:8.5.13-6.el7                             base                                                    1.4 M

Transaction Summary
=====================================================================================================================================================================================
Install  1 Package (+8 Dependent packages)

Total size: 30 M
Total download size: 12 M
Installed size: 61 M
Is this ok [y/d/N]: y
Downloading packages:
(1/8): perl-Env-1.04-2.el7.noarch.rpm                                                                                                                         |  16 kB  00:00:00     
(2/8): perl-Switch-2.16-7.el7.noarch.rpm                                                                                                                      |  22 kB  00:00:00     
(3/8): postgresql-libs-9.2.24-1.el7_5.x86_64.rpm                                                                                                              | 234 kB  00:00:00     
(4/8): postgresql-9.2.24-1.el7_5.x86_64.rpm                                                                                                                   | 3.0 MB  00:00:01     
(5/8): tcl-8.5.13-8.el7.x86_64.rpm                                                                                                                            | 1.9 MB  00:00:01     
(6/8): hwloc-libs-1.11.8-4.el7.x86_64.rpm                                                                                                                     | 1.6 MB  00:00:01     
(7/8): tk-8.5.13-6.el7.x86_64.rpm                                                                                                                             | 1.4 MB  00:00:01     
(8/8): postgresql-server-9.2.24-1.el7_5.x86_64.rpm                                                                                                            | 3.8 MB  00:00:01     
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                                                                6.0 MB/s |  12 MB  00:00:02     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : postgresql-libs-9.2.24-1.el7_5.x86_64                                                                                                                             1/9 
  Installing : 1:tcl-8.5.13-8.el7.x86_64                                                                                                                                         2/9 
  Installing : 1:tk-8.5.13-6.el7.x86_64                                                                                                                                          3/9 
  Installing : postgresql-9.2.24-1.el7_5.x86_64                                                                                                                                  4/9 
  Installing : postgresql-server-9.2.24-1.el7_5.x86_64                                                                                                                           5/9 
  Installing : hwloc-libs-1.11.8-4.el7.x86_64                                                                                                                                    6/9 
  Installing : perl-Switch-2.16-7.el7.noarch                                                                                                                                     7/9 
  Installing : perl-Env-1.04-2.el7.noarch                                                                                                                                        8/9 
  Installing : pbspro-server-18.1.4-0.x86_64                                                                                                                                     9/9 
*** PBS Installation Summary
***
*** Postinstall script called as follows:
*** /opt/pbs/libexec/pbs_postinstall server 18.1.4 /opt/pbs /var/spool/pbs postgres 
***
*** Existing configuration file found: /etc/pbs.conf
***
*** Saving /etc/pbs.conf as /etc/pbs.conf.pre.18.1.4
*** Replacing /etc/pbs.conf with /etc/pbs.conf.18.1.4
*** /etc/pbs.conf has been modified.
*** The original contents have been saved to /etc/pbs.conf.pre.18.1.4
***
*** Registering PBS Pro as a service.
Created symlink from /etc/systemd/system/multi-user.target.wants/pbs.service to /usr/lib/systemd/system/pbs.service.
***
*** PBS_HOME is /app
*** Creating new file /app/pbs_environment
*** WARNING: TZ not set in /app/pbs_environment
***
*** The PBS Pro server has been installed in /opt/pbs/sbin.
*** The PBS Pro scheduler has been installed in /opt/pbs/sbin.
***
*** The PBS Pro communication agent has been installed in /opt/pbs/sbin.
***
*** The PBS Pro MOM has been installed in /opt/pbs/sbin.
***
*** The PBS commands have been installed in /opt/pbs/bin.
***
*** End of /opt/pbs/libexec/pbs_postinstall
  Verifying  : perl-Env-1.04-2.el7.noarch                                                                                                                                        1/9 
  Verifying  : 1:tcl-8.5.13-8.el7.x86_64                                                                                                                                         2/9 
  Verifying  : perl-Switch-2.16-7.el7.noarch                                                                                                                                     3/9 
  Verifying  : hwloc-libs-1.11.8-4.el7.x86_64                                                                                                                                    4/9 
  Verifying  : 1:tk-8.5.13-6.el7.x86_64                                                                                                                                          5/9 
  Verifying  : postgresql-server-9.2.24-1.el7_5.x86_64                                                                                                                           6/9 
  Verifying  : postgresql-libs-9.2.24-1.el7_5.x86_64                                                                                                                             7/9 
  Verifying  : postgresql-9.2.24-1.el7_5.x86_64                                                                                                                                  8/9 
  Verifying  : pbspro-server-18.1.4-0.x86_64                                                                                                                                     9/9 

Installed:
  pbspro-server.x86_64 0:18.1.4-0                                                                                                                                                    

Dependency Installed:
  hwloc-libs.x86_64 0:1.11.8-4.el7          perl-Env.noarch 0:1.04-2.el7 perl-Switch.noarch 0:2.16-7.el7 postgresql.x86_64 0:9.2.24-1.el7_5 postgresql-libs.x86_64 0:9.2.24-1.el7_5
  postgresql-server.x86_64 0:9.2.24-1.el7_5 tcl.x86_64 1:8.5.13-8.el7    tk.x86_64 1:8.5.13-6.el7       

Complete!

[root@n1 pbspro_1.8.4.centos7]# /etc/init.d/pbs start
Starting PBS
PBS Home directory /app needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.
***
*** Setting default queue and resource limits.
***
Connecting to PBS dataservice....connected to PBS dataservice@n1
*** End of /opt/pbs/libexec/pbs_habitat
Home directory /app updated.
/opt/pbs/sbin/pbs_comm ready (pid=40824), Proxy Name:n1.lab.com:17001, Threads:4
PBS comm
Creating usage database for fairshare.
PBS sched
Connecting to PBS dataservice.....connected to PBS dataservice@n1
Licenses valid for 10000000 Floating hosts
PBS server


[root@n1 pbspro_1.8.4.centos7]# qstat -Bf
Server: n1
    server_state = Active
    server_host = n1.lab.com
    scheduling = True
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
	:0 
    default_queue = workq
    log_events = 511
    mail_from = adm
    query_other_jobs = True
    resources_default.ncpus = 1
    default_chunk.ncpus = 1
    scheduler_iteration = 600
    FLicenses = 20000000
    resv_enable = True
    node_fail_requeue = 310
    max_array_size = 10000
    pbs_license_min = 0
    pbs_license_max = 2147483647
    pbs_license_linger_time = 31536000
    license_count = Avail_Global:10000000 Avail_Local:10000000 Used:0 High_Use:
	0
    pbs_version = 18.1.4
    eligible_time_enable = False
    max_concurrent_provision = 5
    power_provisioning = False

Apart from /app *(rw,sync,no_root_squash,subtree_check) have you made any changes to the steps that I had listed?

Rakhen,

No changes were made to any folders or directories.
The above command line protocol were executed as they were.
/app was already available on n1 ( i did not run the exportfs , mount )

Alternatively, could you please try to install locally on the system & start the services & stop the services, then copy $PBS_HOME (do not move) by preserving permissions to /PBS_HOME , update PBS_HOME in /etc/pbs.conf and start the services.

Thank you

Hi Adarsh,

I had successfully configured Failover Setup. Just added subtree_check additionally to the steps listed above.

Thanks for your supportā€¦

1 Like