Pbs_mom adding every IP address as authorized

I have set Ansible to install and configure openpbs for me but when i use it recently i am getting an error where the pbs node does not start.
when i check the log logs it is adding what looks to be every IP possible IPv4 address as authorized

06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.148 as authorized
06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.149 as authorized
06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.150 as authorized
06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.151 as authorized
06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.152 as authorized
06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.153 as authorized

looking at the logs earlier i can see that it does start talking to the schedular (lsr-pbssched01)
06/02/2022 15:59:51;0002;pbs_mom;Svr;Log;Log opened
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;pbs_version=20.0.0
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;hostname=lsr-hpc16;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv4 interface bond0: lsr-hpc16.usc.internal
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: ip6-loopback
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv6 interface bond0: lsr-hpc16
06/02/2022 15:59:51;0100;pbs_mom;Svr;parse_config;file config
06/02/2022 15:59:51;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
06/02/2022 15:59:51;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.1.1 as authorized
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.186 as authorized
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.59 as authorized
06/02/2022 15:59:51;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
06/02/2022 15:59:51;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
06/02/2022 15:59:51;0400;pbs_mom;Hook;resourcedef;hooks_rescdef_checksum(/var/spool/pbs/mom_priv/hooks/resourcedef)=0
06/02/2022 15:59:52;0002;pbs_mom;n/a;ncpus;hyperthreading enabled
06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;allocated log buffer, len 132129
06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;topology exported
06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;attribute 'topology_info = hwloc<?xml version=“1.0” encoding="UTF-
8"?>
06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;resource ‘resources_available.mem = 1056838624kb’ added
06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;resource ‘resources_available.ncpus = 64’ added
06/02/2022 15:59:52;0002;pbs_mom;n/a;initialize;pcpus=64, OS reports 64 cpu(s)
06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP authentication method = resvport
06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = 172.23.20.186:15003,127.0.0.1:15003,1
72.23.20.186:15003
06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 16384
06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm lsr-pbssched01:17001
06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Thread ready
06/02/2022 15:59:52;0006;pbs_mom;Fil;pbs_mom;Version 20.0.0, started, initialization type = 0
06/02/2022 15:59:52;0002;pbs_mom;Svr;pbs_mom;Mom pid = 13453 ready, using ports Server:15001 MOM:15002 RM:15003
06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 172.23.20.186:15003 to pbs_comm lsr-pbssch
ed01:17001
06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm lsr-pbssched01:17001
06/02/2022 15:59:52;0800;pbs_mom;n/a;uname2release;uname release: 5.4.0-113-generic
06/02/2022 15:59:52;0800;pbs_mom;n/a;mom_get_sample;nprocs: 743, cantstat: 1, nomem: 0, skipped: 663, cached: 0
06/02/2022 15:59:52;0001;pbs_mom;Svr;net_restore_handler;net restore handler called
06/02/2022 15:59:54;0c00;pbs_mom;TPP;pbs_mom(Main Thread);tpp_send;*** sd=0, compr_len=26, len=26, dest_sd=429496729*
5
06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at lsr-pbssched01:15001, stream:0
06/02/2022 15:59:54;0400;pbs_mom;Svr;pbs_mom;Received request: 8
06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.0.0.2 as authorized
06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.0.0.3 as authorized
06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.0.0.4 as authorized

but lookig in the server log i see (my node is 172.23.20.186 lsr-hpc16)
06/02/2022 15:59:51;0001;Server@lsr-pbssched01;Svr;Server@lsr-pbssched01;stream_eof, 172.23.20.186 down
06/02/2022 15:59:51;0002;Server@lsr-pbssched01;Node;172.23.20.186;node down: communication closed
06/02/2022 15:59:51;0100;Server@lsr-pbssched01;Node;172.23.20.186;set_all_state;txt=node down: communication closed mi_modtime=0
06/02/2022 15:59:51;0100;Server@lsr-pbssched01;Node;lsr-hpc16;set_vnode_state;vnode.state=0x102 vnode_o.state=0x102 vnode.last_state_change_time=1654149591 vnode_o.last_state_change_time=1654145570 state_bits=0x2 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1
06/02/2022 15:59:54;0002;Server@lsr-pbssched01;Node;172.23.20.186;Hello from MoM on port=15002
06/02/2022 15:59:57;0d80;Server@lsr-pbssched01;TPP;Server@lsr-pbssched01(Thread 0);Received UPDATE from pbs_comm
06/02/2022 15:59:57;0001;Server@lsr-pbssched01;Svr;net_restore_handler;net restore handler called

the pbs conf log on the node is
cat /etc/pbs.conf
#
PBS_SERVER=lsr-pbssched01
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp

the oher nodes are all working fine and the ansible playbook used to work fine, i suspect i am missing something basic but cant put my finger on it, any help will be appericated

  1. PBS_START_COMM=0
    You do not need to run PBS_COMM on the compute nodes. Please set it to 0 and restart the pbs services on the compute node.

This is not an issue, it is recognising all the active interfaces from that compute node
#ip addr | grep “state UP”
For example on the compute node with one active network interface:
06/02/2022 09:52:36;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.1.100 as authorized
06/02/2022 09:52:36;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized

Thanks for getting back to me,
That list of authorised IP addresses was just an excerpt, it looks to be literally adding every single ipv4 address from 0.0.0.2 to 0.255.255.255,
The service does not start while it is iterating through every IP address

I have set the PBS_START_COMM=0 and restarted the pbs service.
i am still getting it iterating through every IP address

it is adding the ones from the local machine during its initalisation
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.1.1 as authorized
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.186 as authorized
06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.59 as authorized

but it then gets a message
06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at lsr-pbssched01:15001, stream:0
06/02/2022 15:59:54;0400;pbs_mom;Svr;pbs_mom;Received request: 8

which then makes it start to iterate through every IP address, excerpt below
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.29 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.30 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.31 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.32 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.33 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.34 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.35 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.36 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.37 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.38 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.39 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.40 as authorized

here is the output of my ip addr command

 ip addr
*1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000*
*    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00*
*    inet 127.0.0.1/8 scope host lo*
*       valid_lft forever preferred_lft forever*
*    inet6 ::1/128 scope host*
*       valid_lft forever preferred_lft forever*
*2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000*
*    link/ether 00:50:56:99:04:62 brd ff:ff:ff:ff:ff:ff*
*    inet 172.23.20.51/24 brd 172.23.20.255 scope global ens160*
*       valid_lft forever preferred_lft forever*
*    inet6 fe80::250:56ff:fe99:462/64 scope link*
*       valid_lft forever preferred_lft forever*

I have set the PBS_START_COMM=0 and restarted the service, still no change to the node adding every IP address as authorized.
this is the output from ip addr

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 90:b1:1c:42:67:ec brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 90:b1:1c:42:67:ee brd ff:ff:ff:ff:ff:ff
4: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 90:b1:1c:42:67:f0 brd ff:ff:ff:ff:ff:ff
5: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether 90:b1:1c:42:67:f2 brd ff:ff:ff:ff:ff:ff
6: enp70s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
    link/ether 3e:a3:3f:a9:ed:8d brd ff:ff:ff:ff:ff:ff
7: enp70s0f1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN group default qlen 1000
    link/ether 3e:a3:3f:a9:ed:8d brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 3e:a3:3f:a9:ed:8d brd ff:ff:ff:ff:ff:ff
    inet 172.23.20.186/24 brd 172.23.20.255 scope global bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::3ca3:3fff:fea9:ed8d/64 scope link
       valid_lft forever preferred_lft forever

the service is still adding every ip address, excerpt below

06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.0 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.1 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.2 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.3 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.4 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.5 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.6 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.7 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.8 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.9 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.10 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.11 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.12 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.13 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.14 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.15 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.16 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.17 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.18 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.19 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.20 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.21 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.22 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.23 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.24 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.25 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.26 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.27 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.28 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.29 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.30 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.31 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.32 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.33 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.34 as authorized

This doesn’t help much, but that sequence of messages says that the server replied to the MoM’s HELLO with a IS_REPLYHELLO (8) message. One of the parts of the IS_REPLYHELLO msg is a list of cluster address ranges, which the MoM then adds to its list. This suggests the MoM was fed bad data by the server.

So, either the server and MoM are not running the same version of the protocol, or the server sent the MoM bogus information. Check out the network addresses on the server and check its startup messages for network info.

I think the IS_REPLYHELLO message might go through pbs_comm, so it might be involved also. Check its version and logs.

Thanks very much for the tip Dale, I will see if I can get anything sensible from a tcpdump and I will also investigate the server version.

So the tcpdump was a dry hole, there was nothing there that i could see that looks useful.
I spun up a new test schedular and did a fresh install and had no issues with adding the new node so i suspect that there is a version issue.
All of the nodes and servers (working and not working) report pbs_version = 20.0.0 when i run sudo /opt/pbs/sbin/pbs_mom --version

Is it possible there has been a change in the mom protocol since last year (when i buit the server) and the version has not been incremented?
i tried building older versions from the pbs git (v20.0.0 & release_20_0_branch) but kept getting issues with the python version, it wanted 3.5 and i have 3.8, which is the same on all the nodes (working and not working) so i think i built off roughly the same version.

So barring any other suggestions it seems that i need to rebuild my server?

To close the loop on this one I ended up rebuilding the server and it is all fine now, both the old and new nodes connect no worries.
Thanks for trying to help

1 Like