Hi @adarsh thanks again.
Yes this is definitely a communication problem, but something is messy. The node can connect to the headnode, but the connection is ended prematurely. I tried attaching strace to the pbs_mom and he is trying to talk with the server but failing immediately.
To narrow things down I’ve enabled pbs_mom on the headnode, so the network is not involved and the same issue is happening. So here we go, the output requested:
on the headnode
[root@headnode ~]# cat /etc/hosts | grep -e headnode -e n01
172.26.255.254 headnode headnode.domain.tld
172.26.0.1 n01 n01.domain.tld
172.27.0.1 n01-ib0 n01-ib0.domain.tld
[root@headnode ~]#pbs_hostn -v headnode
primary name: headnode (from gethostbyname())
aliases: headnode.domain.tld
address length: 4 bytes
address: 172.26.255.254 (4278131372 dec) name: headnode
[root@headnode ~]# pbs_hostn -v n01
primary name: n01 (from gethostbyname())
aliases: n01.domain.tld
address length: 4 bytes
address: 172.26.0.1 (16784044 dec) name: n01
on the compute node
[root@n01 ~]# cat /etc/hosts | grep -e headnode -e n01
[root@n01 ~]# pbs_hostn -v headnode
primary name: headnode.domain.tld (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 172.26.255.254 (4278131372 dec) name: headnode.domain.tld
[root@n01 ~]# pbs_hostn -v n01
primary name: n01.domain.tld (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 172.26.0.1 (16784044 dec) name: n01.domain.tld
[root@n01 ~]#
Observe that /etc/hosts on the compute node does not have the required info because it’s provided by DNS:
[root@n01 ~]# nslookup headnode
Server: 172.26.255.254
Address: 172.26.255.254#53
Name: headnode.domain.tld
Address: 172.26.255.254
[root@n01 ~]# nslookup headnode.domain.tld
Server: 172.26.255.254
Address: 172.26.255.254#53
Name: headnode.domain.tld
Address: 172.26.255.254
[root@n01 ~]# nslookup 172.26.255.254
254.255.26.172.IN-ADDR.ARPA name = headnode.domain.tld.
For the last experiments with MOM on the headnode, the service started but stays in a non functioning state.
[root@headnode ~]# ps ax | grep -i pbs
2588 ? Ssl 0:00 /opt/pbs/sbin/pbs_comm
2611 ? Ssl 0:00 /opt/pbs/sbin/pbs_mom
2709 ? Ssl 0:00 /opt/pbs/sbin/pbs_sched
3216 ? Ss 0:00 /opt/pbs/sbin/pbs_ds_monitor monitor
3270 ? S 0:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
3447 ? Ss 0:00 postgres: postgres pbs_datastore 172.26.255.254(57626) idle
3564 ? Ssl 0:00 /opt/pbs/sbin/pbs_server.bin
15578 pts/77 S+ 0:00 grep --color=auto -i pbs
[root@headnode ~]# ss -tlpn | grep -i pbs
LISTEN 0 1000 0.0.0.0:17001 0.0.0.0:* users:((“pbs_comm”,pid=2588,fd=15))
LISTEN 0 256 0.0.0.0:15001 0.0.0.0:* users:((“pbs_server.bin”,pid=3564,fd=9))
LISTEN 0 256 0.0.0.0:15002 0.0.0.0:* users:((“pbs_mom”,pid=2611,fd=7))
LISTEN 0 256 0.0.0.0:15003 0.0.0.0:* users:((“pbs_mom”,pid=2611,fd=8))
PS: I was reading about PTL. I think I will recompile OpenPBS with PTL enable so we can at least see what’s wrong? Or this is a bad idea?