Below are the details on how I had worked on Failover Setup.
In Open Stack I had created four instances
Instance1 - Primary Server(PS)
Instance2 - Secondary Server(SS)
Instance3 - Node1(i.e execution host)
Instance4 - NFS Server
Case1: When NFS is not being used.
I had configured Primary Server, Secondary Server and Node1 as per the details mentioned in PBS Admin Guide Section 9 for failover setup. PS, SS and Node1 are able to see each other.
At first I had submitted some jobs on the PS and they were executed successfully. Secondly I had stopped PBS on PS within few seconds I could notice that SS has taken over the control and its is unable to execute the jobs. This signifies we need NFS
Case2: When NFS is being used
-Created a PBS_HOME directory which is now hard mounted on PS, SS, Node1 and NFS Server.
-Stopped PBS on PS, SS and Node1
-Did path changes in /etc/pbs.conf for pointing PBS_HOME variable to Shared File System path
-Copied all the files from /var/spool/pbs to /PBS_HOME/
-Configurations for /etc/hosts is also in sync
When I start PBS its unable to start neither on PS, SS and Node1
Below is the error message:
Mar 29 19:30:55 primaryserver.novalocal pbs_init.d[17147]: pbs_sched startup failed, exit 1 aborting.
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: pbs.service: control process exited, code=exited status=1
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: Failed to start Portable Batch System.
ā Subject: Unit pbs.service has failed
ā Defined-By: systemd
ā Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
ā Unit pbs.service has failed.
ā The result is failed.
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: Unit pbs.service entered failed state.
Mar 29 19:30:55 primaryserver.novalocal systemd[1]: pbs.service failed.
Mar 29 19:30:55 primaryserver.novalocal polkitd[3253]: Unregistered Authentication Agent for unix-process:17141:1425045 (system bus name :1.132, object path /org/freedesktop/PolicyKit1/AuthenticationAgent, locale en_US.UTF-8) (disconnect)
While troubleshooting I had encountered below type of message as well
/etc/init.d/pbs start
Starting PBS
PBS comm
/opt/pbs/sbin/pbs_comm ready (pid=12782), Proxy Name:secondaryserver:17001, Threads:4
pbs_sched: Permission denied (13) in chk_file_sec, Security violation ā/PBS_HOME/pbs/sched_privā resolves to ā/PBS_HOMEā
pbs_sched startup failed, exit 1 aborting.
I had made sure that UID is same for PS, SS, Node1 and NFS Server and also shared file system directory has root privileges.
For Failover:
Yes shared file system with file locking mechanism is required for PBS Pro failover setup. Important: Make sure NFS lock services are up and running
Please follow these steps:
Primary Server:
install PBS Server with PBS_HOME directory on the NFS
submit couple of jobs, to make sure setup works by submitting couple of jobs
stop the PBS Services
Secondary Server:
install PBS server with PBS_HOME on the local disk
stop the PBS Services
edit the /etc/pbs.conf and point it to PBS_HOME setup on the NFS
start the PBS Services
make sure setup works by submitting couple of jobs
stop the PBS Services
Note: When starting the services
start the primary pbs server first
start the second pbs server next
When stopping the services
stop the secondary pbs server first
stop the primary pbs server next
Caution: Never try to stop and start PBS Services on the primary and secondary in short span of time, always make sure there is some time gap between starting and stopping (vice versa) the services.
It seems the file/folder permissions of the PBS_HOME and sub directories are disturbed. Please follow the above steps, it would work without any issues.
Below are the steps that I had followed for configuring failover:
1.Created NFS Server with the shared directory named as PBS_HOME
2.Hard mounted shared NFS folder on primary server instance
3.Installed PBS Pro server package by pointing to shared folder.
When I start PBS Server I am facing the below issue
Before the installation did you update the /etc/pbs.conf for PBS_HOME to point to the NFS location .
Are you sure you removed the remnants of the previous installation ?
Note: The location of PBS_HOME is specified in the file /etc/pbs.conf, but defaults to /var/spool/pbs if not specified. The default for PBS_EXEC is /opt/pbs. You can specify a non-default location for PBS_EXEC via the --prefix option to rpm when installing the new PBS.
I did a fresh start of the entire setup. Actually I had exported the value for PBS_HOME so that it points to the NFS location. As I could see before installing the server package there isnāt any /etc/pbs.conf present.
vi /etc/selinux/config (Changed to disabled)
yum -y install nfs-utils libnfsidmap
systemctl enable rpcbind
systemctl start nfs-server
systemctl start rpc-statd
systemctl start nfs-idmapd
systemctl start nfslock
mkdir /PBS_HOME
chmod 777 /PBS_HOME/
vi /etc/exports Adding the text in () to /etc/exports (/PBS_HOME *(rw,sync,no_root_squash))
exportfs -r
showmount -e localhost
systemctl restart nfs-server
PrimaryServer Setup
Setting up NFS client on Primary Server
vi /etc/selinux/config (Make sure selinux is disabled)
yum -y install nfs-utils libnfsidmap
systemctl enable rpcbind
systemctl start rpcbind
systemctl enable nfslock
systemctl enable nfs-server
systemctl start nfs-server
systemctl enable nfs-idmap
systemctl start nfs-idmap
systemctl start nfslock
mkdir /PBS_HOME
mount -o rw,hard,intr :/PBS_HOME /PBS_HOME ----For Hard mounting
df -kh
cd /PBS_HOME/
ll
touch test
ll
vi /etc/fstab Adding the text in () in /etc/fstab (:/PBS_HOME /PBS_HOME nfs rw,sync,hard,intr 0 0)
cd
umount /PBS_HOME/
mount -av
df -kh
reboot
Installing Server Package
useradd --system -m pbsdata
wget https://github.com/PBSPro/pbspro/releases/download/v18.1.4/pbspro_1.8.4.centos7.zip
unzip pbspro_1.8.4.centos7.zip
cp pbspro_1.8.4.centos7/pbspro-server-18.1.4-0.x86_64.rpm ~
export PBS_DATA_SERVICE_USER=pbsdata
export PBS_HOME=/PBS_HOME
yum install pbspro-server-18.1.4-0.x86_64.rpm
vi /etc/pbs.conf (Check whether parameters are in thsi way
PBS_EXEC=/opt/pbs
PBS_HOME=/PBS_HOME
PBS_SERVER=primaryserver
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
)
vi /PBS_HOME/mom_priv/config (CHANGE_THIS_TO_PBS_PRO_SERVER_HOSTNAME to your respective hostname)
vi /etc/hosts (Add your hostname)
/etc/init.d/pbs start
[root@primaryserver ~]# /etc/init.d/pbs start
Starting PBS
PBS Home directory /PBS_HOME needs updating.
Running /opt/pbs/libexec/pbs_habitat to update it.
*** Error initializing the PBS dataservice
Error details:
Creating the PBS Data Serviceā¦
Starting PBS Data Serviceā¦
pg_ctl: could not start server
Examine the log output.
Failed to start PBS Data Service
Error starting PBS Data Service
rm: cannot remove ā/PBS_HOME/datastoreā: Directory not empty
I have even tried chown postgres:root datastore. When I start PBS again its reverted back to pbsdata and fails as well. Also removed datastore and started again still same issue persists.
No changes were made to any folders or directories.
The above command line protocol were executed as they were.
/app was already available on n1 ( i did not run the exportfs , mount )
Alternatively, could you please try to install locally on the system & start the services & stop the services, then copy $PBS_HOME (do not move) by preserving permissions to /PBS_HOME , update PBS_HOME in /etc/pbs.conf and start the services.