I have set Ansible to install and configure openpbs for me but when i use it recently i am getting an error where the pbs node does not start.
when i check the log logs it is adding what looks to be every IP possible IPv4 address as authorized
06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.148 as authorized 06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.149 as authorized 06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.150 as authorized 06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.151 as authorized 06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.152 as authorized 06/02/2022 17:08:38;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.14.6.153 as authorized
looking at the logs earlier i can see that it does start talking to the schedular (lsr-pbssched01) 06/02/2022 15:59:51;0002;pbs_mom;Svr;Log;Log opened 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;pbs_version=20.0.0 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;hostname=lsr-hpc16;pbs_leaf_name=N/A;pbs_mom_node_name=N/A 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv4 interface bond0: lsr-hpc16.usc.internal 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: ip6-loopback 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;ipv6 interface bond0: lsr-hpc16 06/02/2022 15:59:51;0100;pbs_mom;Svr;parse_config;file config 06/02/2022 15:59:51;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999 06/02/2022 15:59:51;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.1.1 as authorized 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.186 as authorized 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.59 as authorized 06/02/2022 15:59:51;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path. 06/02/2022 15:59:51;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/ 06/02/2022 15:59:51;0400;pbs_mom;Hook;resourcedef;hooks_rescdef_checksum(/var/spool/pbs/mom_priv/hooks/resourcedef)=0 06/02/2022 15:59:52;0002;pbs_mom;n/a;ncpus;hyperthreading enabled 06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;allocated log buffer, len 132129 06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;topology exported 06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;attribute 'topology_info = hwloc<?xml version=“1.0” encoding="UTF- 8"?> 06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;resource ‘resources_available.mem = 1056838624kb’ added 06/02/2022 15:59:52;0800;pbs_mom;Node;mom_topology;resource ‘resources_available.ncpus = 64’ added 06/02/2022 15:59:52;0002;pbs_mom;n/a;initialize;pcpus=64, OS reports 64 cpu(s) 06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP authentication method = resvport 06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = 172.23.20.186:15003,127.0.0.1:15003,1 72.23.20.186:15003 06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer 06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 16384 06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done 06/02/2022 15:59:52;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm lsr-pbssched01:17001 06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Thread ready 06/02/2022 15:59:52;0006;pbs_mom;Fil;pbs_mom;Version 20.0.0, started, initialization type = 0 06/02/2022 15:59:52;0002;pbs_mom;Svr;pbs_mom;Mom pid = 13453 ready, using ports Server:15001 MOM:15002 RM:15003 06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 172.23.20.186:15003 to pbs_comm lsr-pbssch ed01:17001 06/02/2022 15:59:52;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm lsr-pbssched01:17001 06/02/2022 15:59:52;0800;pbs_mom;n/a;uname2release;uname release: 5.4.0-113-generic 06/02/2022 15:59:52;0800;pbs_mom;n/a;mom_get_sample;nprocs: 743, cantstat: 1, nomem: 0, skipped: 663, cached: 0 06/02/2022 15:59:52;0001;pbs_mom;Svr;net_restore_handler;net restore handler called 06/02/2022 15:59:54;0c00;pbs_mom;TPP;pbs_mom(Main Thread);tpp_send;*** sd=0, compr_len=26, len=26, dest_sd=429496729* 5 06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at lsr-pbssched01:15001, stream:0 06/02/2022 15:59:54;0400;pbs_mom;Svr;pbs_mom;Received request: 8 06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.0.0.2 as authorized 06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.0.0.3 as authorized 06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.0.0.4 as authorized
but lookig in the server log i see (my node is 172.23.20.186 lsr-hpc16) 06/02/2022 15:59:51;0001;Server@lsr-pbssched01;Svr;Server@lsr-pbssched01;stream_eof, 172.23.20.186 down 06/02/2022 15:59:51;0002;Server@lsr-pbssched01;Node;172.23.20.186;node down: communication closed 06/02/2022 15:59:51;0100;Server@lsr-pbssched01;Node;172.23.20.186;set_all_state;txt=node down: communication closed mi_modtime=0 06/02/2022 15:59:51;0100;Server@lsr-pbssched01;Node;lsr-hpc16;set_vnode_state;vnode.state=0x102 vnode_o.state=0x102 vnode.last_state_change_time=1654149591 vnode_o.last_state_change_time=1654145570 state_bits=0x2 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1 06/02/2022 15:59:54;0002;Server@lsr-pbssched01;Node;172.23.20.186;Hello from MoM on port=15002 06/02/2022 15:59:57;0d80;Server@lsr-pbssched01;TPP;Server@lsr-pbssched01(Thread 0);Received UPDATE from pbs_comm 06/02/2022 15:59:57;0001;Server@lsr-pbssched01;Svr;net_restore_handler;net restore handler called
the pbs conf log on the node is
cat /etc/pbs.conf # PBS_SERVER=lsr-pbssched01 PBS_START_SERVER=0 PBS_START_SCHED=0 PBS_START_COMM=1 PBS_START_MOM=1 PBS_EXEC=/opt/pbs PBS_HOME=/var/spool/pbs PBS_CORE_LIMIT=unlimited PBS_SCP=/usr/bin/scp
the oher nodes are all working fine and the ansible playbook used to work fine, i suspect i am missing something basic but cant put my finger on it, any help will be appericated
PBS_START_COMM=0
You do not need to run PBS_COMM on the compute nodes. Please set it to 0 and restart the pbs services on the compute node.
This is not an issue, it is recognising all the active interfaces from that compute node #ip addr | grep “state UP”
For example on the compute node with one active network interface:
06/02/2022 09:52:36;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.1.100 as authorized
06/02/2022 09:52:36;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
Thanks for getting back to me,
That list of authorised IP addresses was just an excerpt, it looks to be literally adding every single ipv4 address from 0.0.0.2 to 0.255.255.255,
The service does not start while it is iterating through every IP address
I have set the PBS_START_COMM=0 and restarted the pbs service.
i am still getting it iterating through every IP address
it is adding the ones from the local machine during its initalisation 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.1.1 as authorized 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.186 as authorized 06/02/2022 15:59:51;0002;pbs_mom;Svr;pbs_mom;Adding IP address 172.23.20.59 as authorized
but it then gets a message 06/02/2022 15:59:54;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at lsr-pbssched01:15001, stream:0 06/02/2022 15:59:54;0400;pbs_mom;Svr;pbs_mom;Received request: 8
which then makes it start to iterate through every IP address, excerpt below
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.29 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.30 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.31 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.32 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.33 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.34 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.35 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.36 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.37 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.38 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.39 as authorized
06/03/2022 07:42:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.7.28.40 as authorized
here is the output of my ip addr command
ip addr
*1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000*
* link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00*
* inet 127.0.0.1/8 scope host lo*
* valid_lft forever preferred_lft forever*
* inet6 ::1/128 scope host*
* valid_lft forever preferred_lft forever*
*2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000*
* link/ether 00:50:56:99:04:62 brd ff:ff:ff:ff:ff:ff*
* inet 172.23.20.51/24 brd 172.23.20.255 scope global ens160*
* valid_lft forever preferred_lft forever*
* inet6 fe80::250:56ff:fe99:462/64 scope link*
* valid_lft forever preferred_lft forever*
I have set the PBS_START_COMM=0 and restarted the service, still no change to the node adding every IP address as authorized.
this is the output from ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 90:b1:1c:42:67:ec brd ff:ff:ff:ff:ff:ff
3: eno2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 90:b1:1c:42:67:ee brd ff:ff:ff:ff:ff:ff
4: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 90:b1:1c:42:67:f0 brd ff:ff:ff:ff:ff:ff
5: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN group default qlen 1000
link/ether 90:b1:1c:42:67:f2 brd ff:ff:ff:ff:ff:ff
6: enp70s0f0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond0 state UP group default qlen 1000
link/ether 3e:a3:3f:a9:ed:8d brd ff:ff:ff:ff:ff:ff
7: enp70s0f1: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc mq master bond0 state DOWN group default qlen 1000
link/ether 3e:a3:3f:a9:ed:8d brd ff:ff:ff:ff:ff:ff
8: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether 3e:a3:3f:a9:ed:8d brd ff:ff:ff:ff:ff:ff
inet 172.23.20.186/24 brd 172.23.20.255 scope global bond0
valid_lft forever preferred_lft forever
inet6 fe80::3ca3:3fff:fea9:ed8d/64 scope link
valid_lft forever preferred_lft forever
the service is still adding every ip address, excerpt below
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.0 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.1 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.2 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.3 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.4 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.5 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.6 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.7 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.8 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.9 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.10 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.11 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.12 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.13 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.14 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.15 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.16 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.17 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.18 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.19 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.20 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.21 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.22 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.23 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.24 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.25 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.26 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.27 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.28 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.29 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.30 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.31 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.32 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.33 as authorized
06/03/2022 07:48:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 0.5.58.34 as authorized
This doesn’t help much, but that sequence of messages says that the server replied to the MoM’s HELLO with a IS_REPLYHELLO (8) message. One of the parts of the IS_REPLYHELLO msg is a list of cluster address ranges, which the MoM then adds to its list. This suggests the MoM was fed bad data by the server.
So, either the server and MoM are not running the same version of the protocol, or the server sent the MoM bogus information. Check out the network addresses on the server and check its startup messages for network info.
I think the IS_REPLYHELLO message might go through pbs_comm, so it might be involved also. Check its version and logs.
So the tcpdump was a dry hole, there was nothing there that i could see that looks useful.
I spun up a new test schedular and did a fresh install and had no issues with adding the new node so i suspect that there is a version issue.
All of the nodes and servers (working and not working) report pbs_version = 20.0.0 when i run sudo /opt/pbs/sbin/pbs_mom --version
Is it possible there has been a change in the mom protocol since last year (when i buit the server) and the version has not been incremented?
i tried building older versions from the pbs git (v20.0.0 & release_20_0_branch) but kept getting issues with the python version, it wanted 3.5 and i have 3.8, which is the same on all the nodes (working and not working) so i think i built off roughly the same version.
So barring any other suggestions it seems that i need to rebuild my server?
To close the loop on this one I ended up rebuilding the server and it is all fine now, both the old and new nodes connect no worries.
Thanks for trying to help