State-unknown,down

Is anyone installed properly openpbs 20 on Centos 8 Stream?
Nodes are still in unknown,down state.

logs from server comm daemon:

logs from MoM comm daemon:

AND logs with increase log lewel to 2025:
from server comm daemon:

And after that, tfd=20 can register at server:
07/06/2021 12:21:57;0c06;Comm@pbssrv;TPP;Comm@pbssrv(Thread 3);tfd=20, Leaf registered address 192.168.1.1:15001

AND next:

tfd=22 trying to register at the same adress that tfd=18 already registered:

tfd=18 dropping connection and registering it immediately

then tfd=22 registering adress 192.168.1.2:15003 which is already registered via tfd=16.
tfd=16 droping connection and registering it imediatelly:

What is going on???

You’ve got some really messed up networking on the MoMs. They are trying to register the same leaf addresses to pbs_comm more than once. Either they register the address of the other, or they are both connecting twice to pbs_comm thinking that they must connect to two pbs_comm daemons that are, in fact, identical.

Perhaps posting the output if ifconfig on both MoMs as well as their /etc/pbs.conf files would come in handy.

Your logs posted are all over the place – you had several different issues and you just throw everything onto one heap. That isn’t going to make people understand what the last problem you had was all about. In other cases, you simply elide vital information (like the list of addresses that MoM registers to pbs_comm when it starts up).

By the way: the server caches hostname-IP address pairs once it’s up. If you’re using DHCP and the nodes can swap addresses things aren’t going to work – OpenPBS relies on stable IP/name resolution and reverse resolution.

pbs_comm:192.168.1.11:17001: Dest not found

means that the server actually did not register that address to pbs_comm.

But from your messages that appears to have been a transient issue.

07/06/2021 12:38:04;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.1.11:15001 on stream 19
07/06/2021 12:38:04;0001;pbs_mom;Svr;net_down_handler;net down handler called

is because no two TCP connections from MoM can register the same address. In that case the address is stripped as a client and the two connections are dropped, in an attempt to make the “real” owner of the addresses to be registered reconnect after two seconds.

Note: if you have Docker or VMWare running on both hosts and they have interfaces with identical addresses on both MoMs, then you really need to use PBS_LEAF_NAME to tell the different hosts what unique addresses to register with pbs_comm. Either it wasn’t up yet (MoMs are trying to set up communication with the server but it’s not up) or there’s a problem. If it wasn’t up yet it’s not a problem, since the server will try to ping all nodes when it comes up.

This has nothing to do with CentOS Stream 8.

1 Like

Thanks for analyse of the problem.
I can’t provide more details about it because I was reinstall those three machines with other OS.

I don’t know why. It was one pbs_comm on one pbs server with two different addresses on two interfaces. Thats why I use PBS_LEAF_NAME on server and nodes. Nodes has to interfaces too with two different addresses.

It’s because post has limited value of lines so I needed divide post.

OK, next time I provide information from log from beginning of “open log” statement.

I do not use DHCP. All addresses are static.

This because server was restarted at this time.

Weird, each machine has two interfaces witch different addresses. That addresses are static.
/etc/hosts are like this:
192.168.1.1 pbsserver.domain pbsserver pbsrv
192.168.1.2 pbsnode1.domain pbsnode1 n1
192.168.1.3 pbsnode2.domain pbsnode1 n2

But pbsserver.domain has two different adresses from two interfaces so I use PBS_LEAF_NAME=pbsrv(on server and =n1 on first node etc.) to communication was on one of them. Is this OK or I should remove pbsserver.domain entry? I think it is OK.
In pbs.conf, PBS_SERVER=pbsserver .

OK it is waste of time to analyze this. Like I sad I reinstalled those machines and setting up openpbs from beginning. If I will have the same problems I wrote here more detail about configuration.

Regards!

You cannot have the canonical name be listed on two different addresses. That leads to ambiguity when PBSPro uses the canonical name to resolve it back to an address.

Likewise, the server expects PBS_SERVER to resolve only to one address.

I know that. pbsserver.domain is FQDN not canonical.

The canonicalised name is the name obtained when you resolve a name to an address and reverse resolve that address to the first name supplied by the reverse resolver.

That canonicalised name must not resolve to different IP addresses (I.e. it must appear on only one line in/etc/hosts.)

Ahh, OK, so FQDN is canonical name because if I try query short names it always return FQDN name, now I understand, thanks for explanation.

Meanwhile I have pbsserver.domain only in one line in /etc/hosts.
When I mention that :

I have on mind that this name registering all addresses from all interfaces when comm daemon starts, that’s why we use PBS_LEAF_NAME:

Hi,
Like I mention before, I was reinstall 3 machines and setuped: one headnode and two nodes. History was back: Nodes are in unknown,down state and comm daemon screams that:

Leaf x.x.x.x:15003 still connected while another leaf connect arrived, dropping existing connection

What I do wrong? I mention that I do the same configuration on virtualbox and its working properly.

It seems to this error is not related to main problem: state unknown,down, because it arrive on virtual machines too, and on virtual machines states of nodes are free.

It is possible that problems with communication between server and nodes are associate with bonded interfaces? Maybe TPP fails with bounded interfaces? Do anyone have any experience with openpbs and bounded interfaces?

pbs_comm trying register new connection every 15 minutes - is this specified behaviour for TPP ?

Hi,

I have configured bonded interfaces with MTU=9000. Is this not a problem for TPP?

Regards!

Hi,
I resolved this issue. Problem was in switches configuration. It concerns MTU. When You set MTU 9000 on interfaces, on switches You need to set max accept value of MTU.
Regards!

2 Likes