Comm daemon and failover

Hi,
Do we need set PBS_COMM_ROUTERS parameter in pbs.conf on secondary host in failover configuration?

No, in case of failover PBS_COMM_ROUTERS does not need to be set

Regards,
Subhasis

1 Like

Thanks for reply.

OK, so it is normal that comm daemon do not start on secondary server when primary is running and it is active:

[root@secondary ~]# /etc/init.d/pbs start
Starting PBS
PBS comm
pbs_comm: another PBS comm router running at the same port
PBS server
[root@secondary ~]#

???

Ah no, that error you are seeing means something is occupying the same port on the machine “Secondary”. netstat -pant | grep 17001 should show you what process is using that port.

Hmm… “netstat -pant | grep 17001” display nothing:

[user@secondary ~]$ sudo ps -ax | grep pbs
 327647 pts/0    S+     0:00 grep --color=auto pbs
[user@secondary ~]$ sudo netstat -pant | grep 17001
[user@secondary ~]$ sudo /etc/init.d/pbs start
Starting PBS
PBS comm
PBS server
[user@secondary ~]$ pbs_comm: another PBS comm router running at the same port

[user@secondary ~]$ sudo netstat -pant | grep 17001
[user@secondary ~]$ sudo ps -ax | grep pbs
 327729 ?        Ss     0:00 /opt/pbs/sbin/pbs_server.bin
 327736 pts/0    S+     0:00 grep --color=auto pbs
[user@secondary ~]$

How to troubleshooting this?

secondary pbs.conf
PBS_PRIMARY=pri
PBS_SECONDARY=sec
PBS_SERVER=primary
PBS_START_SERVER=1
PBS_START_SCHED=0
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/shared/location
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
PBS_RCP=/usr/bin/false
PBS_LEAF_NAME=sec
PBS_COMM_LOG_EVENTS=2047
primary pbs.conf
PBS_PRIMARY=pri
PBS_SECONDARY=sec
PBS_SERVER=primary
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/shared/location
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
PBS_RCP=/usr/bin/false
PBS_LEAF_NAME=pri
gdb output
(gdb) break lock_out
Breakpoint 1 at 0x4045c0: lock_out. (2 locations)
(gdb) run
Starting program: /opt/pbs/sbin/pbs_comm
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, lock_out (fds=6, op=2) at pbs_comm.c:228
228             (void) lseek(fds, (off_t) 0, SEEK_SET);
(gdb) n
229             flock.l_type = op;
(gdb) n
231             flock.l_start = 0;
(gdb) n
232             flock.l_len = 0;
(gdb) n
234                     if (fcntl(fds, F_SETLK, &flock) != -1) {
(gdb) n

Breakpoint 1, lock_out (op=2, fds=6) at pbs_comm.c:235
235                             if (op == F_WRLCK) {
(gdb) n
go_to_background () at pbs_comm.c:293
293             rc = fork();
(gdb) n
[Detaching after fork from child process 327935]
294             if (rc == -1) /* fork failed */
(gdb) pbs_comm: another PBS comm router running at the same port
n
296             if (rc > 0)
(gdb) n
297                     exit(0); /* parent goes away, allowing booting to continue */
(gdb) n
[Inferior 1 (process 327920) exited normally]
comm log
07/28/2021 14:19:49;0002;Comm@sec;Svr;Log;Log opened
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;pbs_version=20.0.0
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;pbs_build=mach=N/A:security=N/A:configure_args=N/A
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;hostname=secondary.domain;pbs_leaf_name=sec;pbs_mom_node_name=N/A
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;ipv4 interface lo: localhost4.localdomain4
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;ipv4 interface bond0.1: secondary.domain
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;ipv4 interface bond0.2: sec
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;ipv6 interface lo: localhost6.localdomain6
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;ipv6 interface bond0.1: secondary.domain
07/28/2021 14:19:49;0002;Comm@sec;Svr;Comm@sec;ipv6 interface bond0.2: secondary.domain
07/28/2021 14:19:49;0d80;Comm@sec;TPP;Comm@sec(Main Thread);TPP authentication method = resvport
server logs related to secondary host
07/28/2021 14:36:39;0100;Server@primary;Req;;Type 27 request received from root@secondary.domain, sock=24
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;hostname=secondary.domain;pbs_leaf_name=sec;pbs_mom_node_name=N/A
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;ipv4 interface lo: localhost4.localdomain4
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;ipv4 interface bond0.1: secondary.domain
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;ipv4 interface bond0.2: sec
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;ipv6 interface lo: localhost6.localdomain6
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;ipv6 interface bond0.1: secondary.domain
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;ipv6 interface bond0.2: secondary.domain
07/28/2021 14:36:38;0006;Server@secondary;Fil;Server@secondary;Version 20.0.0, started, initialization type = 1
07/28/2021 14:36:38;0002;Server@secondary;Svr;Server@secondary;pbs_server: coming up as Secondary, Primary is pri
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@

I saw in server log something like this:
@^@^@^@^@^

What does that mean?

gdb output with breakpoint set on go_to_background function:

(gdb) run
Starting program: /opt/pbs/sbin/pbs_comm
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".

Breakpoint 1, go_to_background () at pbs_comm.c:514
514                     go_to_background();
(gdb) n
293             rc = fork();
(gdb) n
[Detaching after fork from child process 328580]
294             if (rc == -1) /* fork failed */
(gdb) pbs_comm: another PBS comm router running at the same port
p rc
$6 = 328580
(gdb) n
296             if (rc > 0)
(gdb) n
297                     exit(0); /* parent goes away, allowing booting to continue */
(gdb) n
[Inferior 1 (process 328579) exited normally]
(gdb) n
The program is not being run.

code of go_to_background function
/**
 * @brief
 *		Forks a background process and continues on that, while
 * 		exiting the foreground process. It also sets the child process to
 * 		become the session leader. This function is avaible only on Non-Windows
 * 		platforms and in non-debug mode.
 *
 * @return -  pid_t	- sid of the child process (result of setsid)
 * @retval       >0	- sid of the child process.
 * @retval       -1	- Fork or setsid failed.
 */
static pid_t
go_to_background()
{
	pid_t sid = -1;
	int rc;

	lock_out(lockfds, F_UNLCK);
	rc = fork();
	if (rc == -1) /* fork failed */
		return ((pid_t) -1);
	if (rc > 0)
		exit(0); /* parent goes away, allowing booting to continue */

	lock_out(lockfds, F_WRLCK);
	if ((sid = setsid()) == -1) {
		fprintf(stderr, "pbs_comm: setsid failed");
		return ((pid_t) -1);
	}
	already_forked = 1;
	return sid;
}

I think that problem is witch go_to_background function in src/server/pbs_comm.c file.
When it forking comm process, it has been running two comm deamon processes at the same time and on the same port. Right?

Problem maybe is more complex. When go_to_background function create child process of comm deamon, it goes through lock_out function which return error: “another PBS comm router running at the same port”. So lock_out function is problematic in this case.
On the other hand, why this function work well on primary server???

When I kill comm on primary, comm on secondary starts well (by hand of course).

When I run qterm -s on primary(serwer and scheduler was shutdown), on secondary scheduler and datastore starts automatically (server was start prior), but comm daemon do not stop on primary. So I run on primary: kill -TERM <comm_PID>, then it stops on primary but not start on secondary. When I now starts pbs_comm on secondary “by hand”, it’s start well. qstat -Bf shows that server_host is primary.domain - but primary is shutdown and secondary takes control.
Is this normal behaviour?

The problem is not in the go_to_background() function, it is also not a bug. This is a problem with your hostname resolution setup.

The actual secondary hostname’s FQDN should match the FQDN of the name listed in PBS_SECONDARY. In your case, the server_host of the secondary is set via PBS_LEAF_NAME as sec. This is not matching with the FQDN of the secondary machine, which is probably sec.org.

So, kindly change this:
Change:
PBS_LEAF_NAME=sec.fqdn

and then it should work.

If this works, kindly log this as an issue, we will update the code to make it slightly more robust!

I am not sure that I understand well what I need to do. I need change PBS_LEAF_NAME variable to FQDN hostname or short hostname with dot ?

If it should be a FQDN hostname, I need change /etc/hosts configuration to something like this:
192.169.0.1 primary.domain
192.169.0.2 secondary.domain
192.168.0.1 primary pri
192.168.0.2 secondary sec
where 192.169.x.x is external network
and 192.168.x.x is internal network that I want to pbs communicate in
???

Now I have configured /etc/hosts like this:
192.168.0.1 primary.domain primary pri
192.168.0.2 secondary.domain secondary sec

Given the above /etc/hosts, try setting:
PBS_LEAF_NAME=secondary.domain

Which? With 192.169.x.x or without?

Without the 192.169.x.x network.

So you need this:

and this:

This way, the value in PBS_LEAF_NAME matches the FQDN of the secondary host, secondary.domain exactly

1 Like

Yes, now it is working! Comm daemons starts on primary and secondary without error.
Comm daemon registered his leafs only on internal network that is correct.
When I terminate primary, secondary take control(starts dataservice and scheduler, comm and server starts prior).
I ask in this post how to set PBS_LEAF_NAME and conclusion was that I need use short hostname - not with domain. So conclusions from that post are wrong:

SOLUTION and conclusion is:
PBS_LEAF_NAME should be FQDN of hostname which resolves to IP of internal network!

Many thanks for Your help!
Regards!

1 Like

@subhasisb I want to ask, diffidently, do You have an idea, why server_host variable after running command: qstat -Bf, when secondary takes control, is still setted as “primary.domain” hostname?
This is not a big problem to me now, but I am curious, do I have something bad in my configuration again?

Hi you can look at the PBS admin guide for the right steps to configure this. I think it is still the case of configuration issue, so yes, this is not correct (the server_host should show the secondary host-fqdn when secondary is active).

From the admin guide:
PBS_PRIMARY FQDN of hostname Hostname of primary server host.
If you set PBS_LEAF_NAME on the primary server host, make sure
that PBS_PRIMARY matches PBS_LEAF_NAME on the corre-
sponding host. If you do not set PBS_LEAF_NAME on the server
host, make sure that PBS_PRIMARY matches the hostname of the
server host.
PBS_SECONDARY
FQDN of hostname Hostname of secondary server host.
If you set PBS_LEAF_NAME on the secondary server host, make
sure that PBS_SECONDARY matches PBS_LEAF_NAME on the
corresponding host. If you do not set PBS_LEAF_NAME on the
server host, make sure that PBS_SECONDARY matches the host-
name of the server host.
PBS_SERVER Hostname Name of primary server host. Cannot be longer than 255 characters.
If the short name of the server host resolves to the correct IP address,
you can use the short name for the value of the PBS_SERVER entry
in pbs.conf. If only the FQDN of the server host resolves to the
correct IP address, you must use the FQDN for the value of
PBS_SERVER.

Please check with settings based on this.

Hmm… Everything looks OK:

primary host secondary host compute nodexx
PBS_PRIMARY primary.domain primary.domain primary.domain
PBS_SECONDARY secondary.domain secondary.domain secondary.domain
PBS_SERVER primary primary primary
PBS_LEAF_NAME primary.domain secondary.domain nodexx.domain

I also checked it with FQDN of PBS_SERVER but result was without change.
It is not so important, I think. So I will not be longer take care of this topic.

Regards and thanks for advice!

Hmm that is strange. Please attach the pbs.conf from primary and secondary and your /etc/hosts from both the machines.

Regards,
Subhasis

pbs.conf primary
PBS_PRIMARY=primary.domain
PBS_SECONDARY=primary.domain
PBS_SERVER=primary
PBS_LEAF_NAME=primary.domain
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_MOM=0
PBS_START_COMM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/path/to/shared/pbs_home
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_RCP=/usr/bin/false
pbs.conf secondary
PBS_PRIMARY=primary.domain
PBS_SECONDARY=secondary.domain
PBS_SERVER=primary
PBS_LEAF_NAME=secondary.domain
PBS_START_SERVER=1
PBS_START_SCHED=0
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/path/to/shared/pbs_home
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
PBS_RCP=/usr/bin/false
PBS_COMM_LOG_EVENTS=511
hosts primary
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.1       primary.domain       primary		pri
192.168.0.2       secondary.domain     secondary	sec
192.168.0.10      node01.domain        node01		n1

hosts secondary
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.0.1       primary.domain        primary        pri
192.168.0.2       secondary.domain      secondary      sec
192.168.0.10      node01.domain         node01         n1