Hi Adarsh and chloeadams,
I did another few tests. I created a set of virtual machines on a vpshere composed of primary (gmtest01
), secondary (gmtest02
), submission host (gmtest03
) and one MoM (gmtest04
). The pbs.conf
file was configured on primary/secondary as:
[root@gmtest02 ~]# cat /etc/pbs.conf
PBS_COMM_THREADS=4
PBS_EXEC=/opt/pbs
PBS_HOME=/pbs_spool
PBS_START_SERVER=1
PBS_START_SCHED=0 # 1 on primary
PBS_START_COMM=1
PBS_START_MOM=0
PBS_SERVER=gmtest01.sched.local
PBS_PRIMARY=gmtest01.sched.local
PBS_SECONDARY=gmtest02.sched.local
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
the file /etc/hosts/
did not have entries. Resolution on primary or secondary yielded the same results:
[root@gmtest02 ~]# nmap gmtest02
Starting Nmap 7.92 ( https://nmap.org ) at 2024-07-14 11:46 CEST
Nmap scan report for gmtest02 (172.16.47.137)
Host is up (0.000095s latency).
Not shown: 998 closed tcp ports (reset)
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
Nmap done: 1 IP address (1 host up) scanned in 0.11 seconds
You have new mail in /var/spool/mail/root
[root@gmtest02 ~]# ping gmtest02
PING gmtest02.sched.local (172.16.47.137) 56(84) bytes of data.
64 bytes from gmtest02.sched.local (172.16.47.137): icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from gmtest02.sched.local (172.16.47.137): icmp_seq=2 ttl=64 time=0.013 ms
^C
--- gmtest02.sched.local ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.013/0.015/0.017/0.002 ms
[root@gmtest02 ~]# pbs_hostn -v gmtest02
primary name: gmtest02.sched.local (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 172.16.47.137 (2301563052 dec) name: gmtest02.sched.local
which seems ok but for pbs_hostn
no showing the short name, which I would have considered irrelevant. I created a simple script that generates concatenated jobs:
[davide@gmtest03 ~]$ cat ./concatenation.sh
#!/bin/bash
# clean
rm -f first.sh subseq_0?.sh # test_seed.e* test_seed.o*
# submit first job
cat > first.sh << EoI
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=1GB -r n -l walltime=00:01:30 -q workq -N test_seed
echo "first / seed"
sleep 30
stress -c 1
exit
EoI
chmod u+x first.sh
previous=$(qsub ./first.sh)
echo "seed ID $previous"
nj=2
for i in $(seq 1 $nj); do
cat > "subseq_$i.sh" << EoI
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=1GB -r n -l walltime=00:01:30 -q workq -N test_seed -W depend=afterok:$previous@gmtest01.sched.local
echo "job number $i"
sleep 30
stress -c 1
exit
EoI
chmod u+x "./subseq_$i.sh"
previous=$(qsub ./subseq_$i.sh)
echo "ID n. $i is $previous"
done
exit
and then stopped the primary, waited for failover and run the script; the result was:
[davide@gmtest03 ~]$ ./concatenation.sh
seed ID 1002.gmtest01.sched.local
ID n. 1 is 1003.gmtest01.sched.local
ID n. 2 is 1004.gmtest01.sched.local
[davide@gmtest03 ~]$ qstat -n1
gmtest01.sched.local:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1002.gmtest01.* davide workq test_seed 1269 1 1 1gb 00:01 R 00:00 gmtest04/0
jobs ID 1003 and 1004 are nowhere. If I look in logs:
07/14/2024 11:36:35;00a0;Server@gmtest02;Req;req_reject;Reject reply code=15008, aux=0, type=5, from davide@172.16.47.135
If now I add /etc/hosts/
files with IP address, FQDN, short name and do restart of PBS in the proper order it works:
root@gmtest02 ~]# systemctl status pbs
● pbs.service - Portable Batch System
Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
Active: active (running) since Sun 2024-07-14 11:57:59 CEST; 8min ago
Docs: man:pbs(8)
Process: 1136 ExecStart=/opt/pbs/libexec/pbs_init.d start (code=exited, status=0/SUCCESS)
Tasks: 9
Memory: 18.7M
CGroup: /system.slice/pbs.service
├─1387 /opt/pbs/sbin/pbs_comm
├─1447 /opt/pbs/sbin/pbs_server.bin
├─1807 /opt/pbs/sbin/pbs_ds_monitor monitor
├─1839 /usr/bin/postgres -D /pbs_spool/datastore -p 15007
├─1849 postgres: logger process
├─1851 postgres: checkpointer process
├─1852 postgres: writer process
├─1853 postgres: wal writer process
├─1854 postgres: autovacuum launcher process
├─1855 postgres: stats collector process
├─1856 postgres: bgworker: logical replication launcher
├─1892 postgres: postgres pbs_datastore 172.16.47.137(36796) idle
└─1905 /opt/pbs/sbin/pbs_sched
[davide@gmtest03 ~]$ ./concatenation.sh
seed ID 1014.gmtest01.sched.local
ID n. 1 is 1015.gmtest01.sched.local
ID n. 2 is 1016.gmtest01.sched.local
[davide@gmtest03 ~]$ qstat -n1
gmtest01.sched.local:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1014.gmtest01.* davide workq test_seed 1611 1 1 1gb 00:01 R 00:00 gmtest04/0
1015.gmtest01.* davide workq test_seed -- 1 1 1gb 00:01 H -- --
1016.gmtest01.* davide workq test_seed -- 1 1 1gb 00:01 H -- --
and pbs_hostn -v
resolves the short name:
[root@gmtest02 ~]# pbs_hostn -v gmtest01
primary name: gmtest01.sched.local (from gethostbyname())
aliases: gmtest01
address length: 4 bytes
address: 172.16.47.137 (2301563052 dec) name: gmtest01.sched.local
not sure if the two things are correlated but these are the differences that I found. To me the DNS is working properly. What else I can check?