State-unknown,down

boboshaq · June 30, 2021, 8:45am

Is anyone installed properly openpbs 20 on Centos 8 Stream?
Nodes are still in unknown,down state.

boboshaq · July 6, 2021, 7:35am

logs from server comm daemon:

logs from MoM comm daemon:

boboshaq · July 6, 2021, 12:14pm

AND logs with increase log lewel to 2025:
from server comm daemon:

And after that, tfd=20 can register at server:
07/06/2021 12:21:57;0c06;Comm@pbssrv;TPP;Comm@pbssrv(Thread 3);tfd=20, Leaf registered address 192.168.1.1:15001

AND next:

tfd=22 trying to register at the same adress that tfd=18 already registered:

tfd=18 dropping connection and registering it immediately

then tfd=22 registering adress 192.168.1.2:15003 which is already registered via tfd=16.
tfd=16 droping connection and registering it imediatelly:

What is going on???

alexis.cousein · July 8, 2021, 10:03am

You’ve got some really messed up networking on the MoMs. They are trying to register the same leaf addresses to pbs_comm more than once. Either they register the address of the other, or they are both connecting twice to pbs_comm thinking that they must connect to two pbs_comm daemons that are, in fact, identical.

Perhaps posting the output if ifconfig on both MoMs as well as their /etc/pbs.conf files would come in handy.

Your logs posted are all over the place – you had several different issues and you just throw everything onto one heap. That isn’t going to make people understand what the last problem you had was all about. In other cases, you simply elide vital information (like the list of addresses that MoM registers to pbs_comm when it starts up).

By the way: the server caches hostname-IP address pairs once it’s up. If you’re using DHCP and the nodes can swap addresses things aren’t going to work – OpenPBS relies on stable IP/name resolution and reverse resolution.

pbs_comm:192.168.1.11:17001: Dest not found

means that the server actually did not register that address to pbs_comm.

But from your messages that appears to have been a transient issue.

07/06/2021 12:38:04;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.1.11:15001 on stream 19
07/06/2021 12:38:04;0001;pbs_mom;Svr;net_down_handler;net down handler called

is because no two TCP connections from MoM can register the same address. In that case the address is stripped as a client and the two connections are dropped, in an attempt to make the “real” owner of the addresses to be registered reconnect after two seconds.

Note: if you have Docker or VMWare running on both hosts and they have interfaces with identical addresses on both MoMs, then you really need to use PBS_LEAF_NAME to tell the different hosts what unique addresses to register with pbs_comm. Either it wasn’t up yet (MoMs are trying to set up communication with the server but it’s not up) or there’s a problem. If it wasn’t up yet it’s not a problem, since the server will try to ping all nodes when it comes up.

This has nothing to do with CentOS Stream 8.

boboshaq · July 8, 2021, 2:07pm

Thanks for analyse of the problem.
I can’t provide more details about it because I was reinstall those three machines with other OS.

I don’t know why. It was one pbs_comm on one pbs server with two different addresses on two interfaces. Thats why I use PBS_LEAF_NAME on server and nodes. Nodes has to interfaces too with two different addresses.

It’s because post has limited value of lines so I needed divide post.

OK, next time I provide information from log from beginning of “open log” statement.

I do not use DHCP. All addresses are static.

This because server was restarted at this time.

Weird, each machine has two interfaces witch different addresses. That addresses are static.
/etc/hosts are like this:
192.168.1.1 pbsserver.domain pbsserver pbsrv
192.168.1.2 pbsnode1.domain pbsnode1 n1
192.168.1.3 pbsnode2.domain pbsnode1 n2

But pbsserver.domain has two different adresses from two interfaces so I use PBS_LEAF_NAME=pbsrv(on server and =n1 on first node etc.) to communication was on one of them. Is this OK or I should remove pbsserver.domain entry? I think it is OK.
In pbs.conf, PBS_SERVER=pbsserver .

OK it is waste of time to analyze this. Like I sad I reinstalled those machines and setting up openpbs from beginning. If I will have the same problems I wrote here more detail about configuration.

Regards!

alexis.cousein · July 8, 2021, 2:47pm

You cannot have the canonical name be listed on two different addresses. That leads to ambiguity when PBSPro uses the canonical name to resolve it back to an address.

Likewise, the server expects PBS_SERVER to resolve only to one address.

boboshaq · July 9, 2021, 7:07am

I know that. pbsserver.domain is FQDN not canonical.

alexis.cousein · July 9, 2021, 7:34am

The canonicalised name is the name obtained when you resolve a name to an address and reverse resolve that address to the first name supplied by the reverse resolver.

That canonicalised name must not resolve to different IP addresses (I.e. it must appear on only one line in/etc/hosts.)

boboshaq · July 9, 2021, 8:03am

Ahh, OK, so FQDN is canonical name because if I try query short names it always return FQDN name, now I understand, thanks for explanation.

Meanwhile I have pbsserver.domain only in one line in /etc/hosts.
When I mention that :

I have on mind that this name registering all addresses from all interfaces when comm daemon starts, that’s why we use PBS_LEAF_NAME:

boboshaq · July 9, 2021, 10:22am

Hi,
Like I mention before, I was reinstall 3 machines and setuped: one headnode and two nodes. History was back: Nodes are in unknown,down state and comm daemon screams that:

Leaf x.x.x.x:15003 still connected while another leaf connect arrived, dropping existing connection

What I do wrong? I mention that I do the same configuration on virtualbox and its working properly.

boboshaq · July 12, 2021, 10:21am

It seems to this error is not related to main problem: state unknown,down, because it arrive on virtual machines too, and on virtual machines states of nodes are free.

It is possible that problems with communication between server and nodes are associate with bonded interfaces? Maybe TPP fails with bounded interfaces? Do anyone have any experience with openpbs and bounded interfaces?

pbs_comm trying register new connection every 15 minutes - is this specified behaviour for TPP ?

boboshaq · July 13, 2021, 11:49am

Hi,

I have configured bonded interfaces with MTU=9000. Is this not a problem for TPP?

Regards!

boboshaq · July 15, 2021, 9:20am

Hi,
I resolved this issue. Problem was in switches configuration. It concerns MTU. When You set MTU 9000 on interfaces, on switches You need to set max accept value of MTU.
Regards!

Topic		Replies	Views
Communication trouble between pbs_comm, server and MoM when node IP updated Users/Site Administrators	0	800	August 18, 2022
PP-682: PBS Comm to registered new connection request for already registered node Developers	3	969	April 3, 2017
New PBS Nodes failing to register Users/Site Administrators	2	410	December 1, 2023
State=unknown, down. PBSr20 on CentOS8.1 Users/Site Administrators	2	919	June 10, 2020
After installation, execution nodes state = state-unknown Users/Site Administrators	2	63	October 10, 2024

State-unknown,down

Related topics