Neither primary or secondary server

Hi,
I trying to configure failover with instruction from chapter 9.2.5.3 of ADM guide.
PBS is not running on primary and secondary hosts. For now I configured primary server. Before PBS was running on schared PBS_HOME location without problem withot failover settings. Then I set PBS_PRIMARY and PBS_SECONDARY parameters to pbs.conf file and I was try to start pbs on primary host. Sched does not start with information: pbs_sched, neither primary or secondary server .

/etc/init.d/pbs start
Starting PBS
PBS comm
/opt/pbs/sbin/pbs_comm ready (pid=72843), Proxy Name:pri:17001, Threads:4
pbs_sched: pbs_sched, neither primary or secondary server
pbs_sched startup failed, exit 1 aborting.
sched log:
07/21/2021 16:14:50;0002;pbs_sched;Svr;Log;Log opened
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;pbs_version=20.0.0
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;pbs_build=mach=N/A:security=N/A:configure_args=N/A
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;hostname=primary.domain;pbs_leaf_name=pri;pbs_mom_node_name=N/A
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv4 interface lo: localhost4.localdomain4 
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv4 interface eth1: primary.domain 
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv4 interface eth2: primary.domain 
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv4 interface eth3: pri 
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv6 interface lo: localhost6.localdomain6 
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv6 interface eth1: primary.domain 
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv6 interface eth2: primary.domain 
07/21/2021 16:14:50;0002;pbs_sched;Svr;pbs_sched;ipv6 interface eth3: primary.domain 
07/21/2021 16:15:36;0002;pbs_sched;n/a;setup_env;read environment from /shared/location/pbs_environment
07/21/2021 16:15:36;0001;pbs_sched;Svr;pbs_sched;pbs_sched, neither primary or secondary server
pbs.conf
PBS_PRIMARY=pri
PBS_SECONDARY=sec
PBS_SERVER=primary
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/shared/location
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
PBS_RCP=/usr/bin/false
PBS_LEAF_NAME=pri
/etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.1    primary.domain		primary		pri
192.168.0.5    secondary.domain		secondary	sec
192.168.0.2    node01.domain		node01		n1
192.168.0.3    node02.domain		node02		n2
192.168.0.4    server_nfs.domain	server_nfs

There are no server logs because it does not start yet !!!
But why? Is this BUG??

First comm daemon was start, next sched daemon was trying to start but it does not see any servers so all process goes down.

How to do it?
Should I run pbs without failover settings on secondary server first and then run pbs on primary server with failover settings? Next shutdown pbs on secondary server, change settings in pbs.conf for failover and start pbs on secondary host?

Regards!

Weird… First I start pbs_server then pbs_comm and at last pbs_sched. Pbs_sched does not see that pbs_server is running:

[root@primary ~]# ps -ax | grep pbs
   3439 pts/0    S+     0:00 grep --color=auto pbs
[root@primary ~]# pbs_server
Notifying Secondary Server that we are taking over
Connecting to PBS dataservice...connected to PBS dataservice@pri
[root@primary ~]# ps -ax | grep pbs
   3495 ?        Ss     0:00 /opt/pbs/sbin/pbs_ds_monitor monitor
   3524 ?        S      0:00 /usr/bin/postgres -D /mnt/pbshome/datastore -p 15007
   3574 ?        Ss     0:00 postgres: postgres pbs_datastore 192.168.0.1(57780) idle
   3578 ?        Ssl    0:00 /opt/pbs/sbin/pbs_server.bin
   3582 pts/0    R+     0:00 grep --color=auto pbs
[root@primary ~]# pbs_comm 
[root@primary ~]# pbs_comm ready (pid=3682), Proxy Name:pri:17001, Threads:4

[root@primary ~]# ps -ax | grep pbs
   3495 ?        Ss     0:00 /opt/pbs/sbin/pbs_ds_monitor monitor
   3524 ?        S      0:00 /usr/bin/postgres -D /mnt/pbshome/datastore -p 15007
   3574 ?        Ss     0:00 postgres: postgres pbs_datastore 192.168.0.1(57780) idle
   3578 ?        Ssl    0:00 /opt/pbs/sbin/pbs_server.bin
   3682 ?        Ssl    0:00 pbs_comm
   3691 pts/0    R+     0:00 grep --color=auto pbs
[root@primary ~]# pbs_sched 
pbs_sched: pbs_sched, neither primary or secondary server
[root@primary ~]#

Hello!
Anyone can help with this?

I found this code in server/pbsd_main.c file. Can anyone tell me what it do?

/* make sure no other server is running with this home directory */

	(void)sprintf(lockfile, "%s/%s/server.lock", pbs_conf.pbs_home_path,
		PBS_SVR_PRIVATE);
	if ((are_primary = are_we_primary()) == FAILOVER_SECONDARY) {
		strcat(lockfile, ".secondary");
	} else if (are_primary == FAILOVER_CONFIG_ERROR) {
		log_err(-1, msg_daemonname, "neither primary or secondary server");
		return (3);
	}

Please attach gdb to pbs_sched process and set a breakpoint at function are_we_primary(). Then we can find out why this function is returning -1

Hi @subhasisb ,
Thanks for reply.
I use gdb like You said. I descry error in manner of comparing hostnames in are_we_primary() scheduler function:

Summary of gdb
[root@primary scheduler]# gdb pbs_sched
GNU gdb (GDB) Red Hat Enterprise Linux 8.2-15.el8
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from pbs_sched...done.
(gdb) break are_we_primary
Breakpoint 1 at 0x46ba19: file pbs_sched_utils.cpp, line 1112.
(gdb) n
The program is not being run.
(gdb) run
Starting program: openpbs/src/scheduler/pbs_sched 
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-151.el8.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, are_we_primary () at pbs_sched_utils.cpp:1112
1112		if ((c = are_we_primary()) == 1) {
Missing separate debuginfos, use: yum debuginfo-install libblkid-2.32.1-27.el8.x86_64 libgcc-8.4.1-1.el8.x86_64 libical-3.0.3-3.el8.x86_64 libicu-60.3-2.el8_1.x86_64 libmount-2.32.1-27.el8.x86_64 libstdc++-8.4.1-1.el8.x86_64 libuuid-2.32.1-27.el8.x86_64 libxcrypt-4.1.1-4.el8.x86_64 openssl-libs-1.1.1g-15.el8_3.x86_64 python3-libs-3.6.8-37.el8.rocky.x86_64 sssd-client-2.4.0-9.el8_4.1.x86_64 systemd-libs-239-45.el8_4.1.x86_64 zlib-1.2.11-17.el8.x86_64
(gdb) n
465			snprintf(server_host, sizeof(server_host), "%s", pbs_conf.pbs_leaf_name);
(gdb) n
466			endp = strchr(server_host, ','); /* find the first name */
(gdb) n
467			if (endp)
(gdb) n
469			endp = strchr(server_host, ':'); /* cut out the port */
(gdb) n
470			if (endp)
(gdb) n
479		if ((pbs_conf.pbs_secondary == NULL) && (pbs_conf.pbs_primary == NULL))
(gdb) n
481		if ((pbs_conf.pbs_secondary == NULL) || (pbs_conf.pbs_primary == NULL))
(gdb) p pbs_conf.pbs_secondary
$1 = 0x6e7b60 "sec"
(gdb) p pbs_conf.pbs_primary
$2 = 0x6e7b40 "pri"
(gdb) n
484		if (get_fullhostname(pbs_conf.pbs_primary, hn1, (sizeof(hn1) - 1)) == -1) {
(gdb) n
489		if (strcmp(hn1, server_host) == 0)
(gdb) p hn1
$3 = "primary.domain", '\000' <repeats 233 times>
(gdb) p server_host
$4 = "pri", '\000' <repeats 75 times>, "cput\000\000\000\000\200\206\272\366\377\177\000\000\300\340\377\377\377\177\000\000\003\000\000\000\000\000\000\000mem\000\377\177\000\000\000\232\060\001a˝p\340\340\377\377\377\177\000\000\b\000\000\000\000\000\000\000walltime\000\220\224\366\377\177\000\000\000\341\377\377\377\177\000\000\r\000\000\000\000\000\000\000\002", '\000' <repeats 78 times>
(gdb) s
sched_main (argc=1, argv=0x7fffffffe368, sched_ptr=<optimized out>) at pbs_sched_utils.cpp:1112
1112		if ((c = are_we_primary()) == 1) {
(gdb) n
1117			log_err(-1, "pbs_sched", "neither primary or secondary server");
(gdb) n
pbs_sched: pbs_sched, neither primary or secondary server
1118			exit(1);
(gdb) n
[Inferior 1 (process 2330) exited with code 01]

It is comparing short hostname of pbs_leaf_name (server_host variable) value with FQDN hostname(hn1 variable) of PBS_PRIMARY and PBS_SECONDARY value.
When I change a bit code of are_we_primary function scheduler runs whithout error.

File with are_we_primary function is: src/scheduler/pbs_sched_utils.cpp

oryginal function
are_we_primary()
{
	char server_host[PBS_MAXHOSTNAME + 1];
	char hn1[PBS_MAXHOSTNAME + 1];

	if (pbs_conf.pbs_leaf_name) {
		char *endp;
		snprintf(server_host, sizeof(server_host), "%s", pbs_conf.pbs_leaf_name);
		endp = strchr(server_host, ','); /* find the first name */
		if (endp)
			*endp = '\0';
		endp = strchr(server_host, ':'); /* cut out the port */
		if (endp)
			*endp = '\0';
	} else if ((gethostname(server_host, (sizeof(server_host) - 1)) == -1) ||
		   (get_fullhostname(server_host, server_host, (sizeof(server_host) - 1)) == -1)) {
		log_err(-1, __func__, "Unable to get my host name");
		return -1;
	}

	/* both secondary and primary should be set or neither set */
	if ((pbs_conf.pbs_secondary == NULL) && (pbs_conf.pbs_primary == NULL))
		return 1;
	if ((pbs_conf.pbs_secondary == NULL) || (pbs_conf.pbs_primary == NULL))
		return -1;

	if (get_fullhostname(pbs_conf.pbs_primary, hn1, (sizeof(hn1) - 1)) == -1) {
		log_err(-1, __func__, "Unable to get full host name of primary");
		return -1;
	}

	if (strcmp(hn1, server_host) == 0)
		return 1; /* we are the listed primary */

	if (get_fullhostname(pbs_conf.pbs_secondary, hn1, (sizeof(hn1) - 1)) == -1) {
		log_err(-1, __func__, "Unable to get full host name of secondary");
		return -1;
	}
	if (strcmp(hn1, server_host) == 0)
		return 0; /* we are the secondary */

	return -1; /* cannot be neither */
}
Propose of change:
are_we_primary()
{
        char server_host[PBS_MAXHOSTNAME + 1];
        char hn1[PBS_MAXHOSTNAME + 1];
        char srvh[PBS_MAXHOSTNAME + 1];

        if (pbs_conf.pbs_leaf_name) {
                char *endp;
                snprintf(server_host, sizeof(server_host), "%s", pbs_conf.pbs_leaf_name);
                endp = strchr(server_host, ','); /* find the first name */
                if (endp)
                        *endp = '\0';
                endp = strchr(server_host, ':'); /* cut out the port */
                if (endp)
                        *endp = '\0';
        } else if ((gethostname(server_host, (sizeof(server_host) - 1)) == -1) ||
                   (get_fullhostname(server_host, server_host, (sizeof(server_host) - 1)) == -1)) {
                log_err(-1, __func__, "Unable to get my host name");
                return -1;
        }

        /* both secondary and primary should be set or neither set */
        if ((pbs_conf.pbs_secondary == NULL) && (pbs_conf.pbs_primary == NULL))
                return 1;
        if ((pbs_conf.pbs_secondary == NULL) || (pbs_conf.pbs_primary == NULL))
                return -1;

         if (get_fullhostname(server_host, srvh, (sizeof(srvh) - 1)) == -1) {
                log_err(-1, __func__, "Unable to get full host name of pbs_leaf_name");
                return -1;
        }


        if (get_fullhostname(pbs_conf.pbs_primary, hn1, (sizeof(hn1) - 1)) == -1) {
                log_err(-1, __func__, "Unable to get full host name of primary");
                return -1;
        }

        if (strcmp(hn1, srvh) == 0)
                return 1; /* we are the listed primary */

        if (get_fullhostname(pbs_conf.pbs_secondary, hn1, (sizeof(hn1) - 1)) == -1) {
                log_err(-1, __func__, "Unable to get full host name of secondary");
                return -1;
        }
        if (strcmp(hn1, srvh) == 0)
                return 0; /* we are the secondary */

        return -1; /* cannot be neither */
}

I added srvh variable:
char srvh[PBS_MAXHOSTNAME + 1];
which becomes a FQDN hostname of PBS_LEAF_NAME variable:
get_fullhostname(server_host, srvh, …
and now we can compare two variables with fullhostnames:
…strcmp(hn1, srvh)…

I am not computer programmer so maybe this solution need to be done with another way, but for me this working good.

Regards!

OK, this is not a BUG! This is misunderstanding with declaring PBS_LEAF_NAME value:

PBS_LEAF_NAME should be FQDN of hostname which resolves to IP of internal network!
Not short hostname.

Regards!