Secondary server does manage job dependency

tuthmose · July 9, 2024, 12:03pm

Hi all,

I am facing an unexpected behaviour and I am not sure if is makes sense or I have a problem. I have a primary / secondary PBS setup (pbs_version = 22.05.11) sharing a NFS area on Rocky Linux (Rocky Linux release 8.8 (Green Obsidian)). There is no SELinux and no firewall and primary and secondary server have static IPs. pbs.conf on the primary looks like:

PBS_COMM_THREADS=4
PBS_EXEC=/opt/pbs 
PBS_HOME=/share/pbs_spool 
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_SERVER=pbs01.example.com
PBS_PRIMARY=pbs01.example.com
PBS_SECONDARY=pbs02.example.com
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
PBS_COMM_LOG_EVENTS=1025

and on the secondary:

PBS_COMM_THREADS=4
PBS_EXEC=/opt/pbs 
PBS_HOME=/share/pbs_spool 
PBS_START_SERVER=1
PBS_START_SCHED=0
PBS_START_COMM=1
PBS_START_MOM=0
PBS_SERVER=pbs01.example.com
PBS_PRIMARY=pbs01.example.com
PBS_SECONDARY=pbs02.example.com
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
PBS_COMM_LOG_EVENTS=1025

I dare to say it’s not a very customized setup. The system can fail over from primary to secondary seamlessly and back to primary as well. However, when pbs02 is active I can submit single jobs but not dependency. If I try qsub yields a job ID that is not present in the db (no qstat result) and the server logs show:

07/09/2024 08:58:22;00a0;Server@pbs02;Req;req_reject;Reject reply code=15008, aux=0, type=5, from tuthmose@pbs02

according to the RG 15008 is: PBSE_BADHOST 15008 Access from host not allowed. Why?
and why it allows single jobs. The result is the same whether I use:

just job number
complete job ID: 1234.pbs01.example.com
specify I am on pbs02: 1234.pbs01.example.com@pbs02.example.com

I have tried setting acl_host_enable=True and acl_hosts=pbs02.example.com but it yields:

qsub: Access from host not allowed, or unknown host

and actually the docs say that by default all hosts are enabled. What I am doing wrong?

tuthmose · July 10, 2024, 8:08pm

Hi all,

it was a rather silly thing in the end: the /etc/hosts files in the server had the short name before the FQDN; pbs_hostn was solving everything correctly but when using the full name in the job ID this did not work. I switched the order in /etc/hosts and now it works fine. I suspect that it would have worked also with the previous order if using 1234.pbs01.example.com@pbs01.example.com but I am not sure (will try) since qstat and similar commands always show the full name.

I am perplexed however. In these machines I have a DNS service running and /etc/resolv.conf is set correctly; I can ping servers between each other using either IP or short name or FQDN. However if I comment all contents in /etc/hosts/ pbs_hosts does not solve names anymore. Does PBS need /etc/hosts?

adarsh · July 11, 2024, 6:44am

PBS wants name resolution should be correct and work correctly ( /etc/resolv.conf should tell the where to check)
State-unknown,down - #12 by alexis.cousein
DNS services might fail or during power interrupt it might come up late , hence maintaining all the details of all the hosts of the cluster (same across server and compute nodes) in the /etc/hosts is the safest option

tuthmose · July 11, 2024, 6:53am

Hi Adarsh,

thank you for your reply. So having /etc/hosts may be handy but it is not mandatory. If without it pbs_hosts does not resolve names I should look in my DNS? But why normal ping and lookups do work?

adarsh · July 11, 2024, 6:53pm

Thank you Giordano. Ping will just check whether desitnation is reachable or not using some protocol. It does not do the name resolution / reverse address resolution check.

chloeadams · July 13, 2024, 4:51pm

Thanks for your reply, It really work for me.

tuthmose · July 14, 2024, 10:47am

Hi Adarsh and chloeadams,

I did another few tests. I created a set of virtual machines on a vpshere composed of primary (gmtest01), secondary (gmtest02), submission host (gmtest03) and one MoM (gmtest04). The pbs.conf file was configured on primary/secondary as:

[root@gmtest02 ~]# cat /etc/pbs.conf
PBS_COMM_THREADS=4
PBS_EXEC=/opt/pbs
PBS_HOME=/pbs_spool
PBS_START_SERVER=1
PBS_START_SCHED=0 # 1 on primary
PBS_START_COMM=1
PBS_START_MOM=0
PBS_SERVER=gmtest01.sched.local
PBS_PRIMARY=gmtest01.sched.local
PBS_SECONDARY=gmtest02.sched.local
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp

the file /etc/hosts/ did not have entries. Resolution on primary or secondary yielded the same results:

[root@gmtest02 ~]# nmap gmtest02
Starting Nmap 7.92 ( https://nmap.org ) at 2024-07-14 11:46 CEST
Nmap scan report for gmtest02 (172.16.47.137)
Host is up (0.000095s latency).
Not shown: 998 closed tcp ports (reset)
PORT    STATE SERVICE
22/tcp  open  ssh
111/tcp open  rpcbind

Nmap done: 1 IP address (1 host up) scanned in 0.11 seconds
You have new mail in /var/spool/mail/root
[root@gmtest02 ~]# ping gmtest02
PING gmtest02.sched.local (172.16.47.137) 56(84) bytes of data.
64 bytes from gmtest02.sched.local (172.16.47.137): icmp_seq=1 ttl=64 time=0.017 ms
64 bytes from gmtest02.sched.local (172.16.47.137): icmp_seq=2 ttl=64 time=0.013 ms
^C
--- gmtest02.sched.local ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.013/0.015/0.017/0.002 ms
[root@gmtest02 ~]# pbs_hostn -v gmtest02
primary name: gmtest02.sched.local (from gethostbyname())
aliases:            -none-
     address length:  4 bytes
     address:        172.16.47.137   (2301563052 dec)  name:  gmtest02.sched.local

which seems ok but for pbs_hostn no showing the short name, which I would have considered irrelevant. I created a simple script that generates concatenated jobs:

[davide@gmtest03 ~]$ cat ./concatenation.sh
#!/bin/bash

# clean
rm -f first.sh subseq_0?.sh # test_seed.e* test_seed.o*

# submit first job

cat > first.sh << EoI
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=1GB -r n -l walltime=00:01:30 -q workq -N test_seed

echo "first / seed"
sleep 30
stress -c 1
exit
EoI

chmod u+x first.sh
previous=$(qsub ./first.sh)
echo "seed ID $previous"

nj=2
for i in $(seq 1 $nj); do
cat > "subseq_$i.sh" << EoI
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=1GB -r n -l walltime=00:01:30 -q workq -N test_seed -W depend=afterok:$previous@gmtest01.sched.local

echo "job number $i"
sleep 30
stress -c 1
exit
EoI
chmod u+x "./subseq_$i.sh"
previous=$(qsub ./subseq_$i.sh)
echo "ID n. $i is $previous"
done

exit

and then stopped the primary, waited for failover and run the script; the result was:

[davide@gmtest03 ~]$ ./concatenation.sh
seed ID 1002.gmtest01.sched.local
ID n. 1 is 1003.gmtest01.sched.local
ID n. 2 is 1004.gmtest01.sched.local
[davide@gmtest03 ~]$ qstat -n1

gmtest01.sched.local:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1002.gmtest01.* davide   workq    test_seed    1269   1   1    1gb 00:01 R 00:00 gmtest04/0

jobs ID 1003 and 1004 are nowhere. If I look in logs:

07/14/2024 11:36:35;00a0;Server@gmtest02;Req;req_reject;Reject reply code=15008, aux=0, type=5, from davide@172.16.47.135

If now I add /etc/hosts/ files with IP address, FQDN, short name and do restart of PBS in the proper order it works:

root@gmtest02 ~]# systemctl status pbs
● pbs.service - Portable Batch System
   Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
   Active: active (running) since Sun 2024-07-14 11:57:59 CEST; 8min ago
     Docs: man:pbs(8)
  Process: 1136 ExecStart=/opt/pbs/libexec/pbs_init.d start (code=exited, status=0/SUCCESS)
    Tasks: 9
   Memory: 18.7M
   CGroup: /system.slice/pbs.service
           ├─1387 /opt/pbs/sbin/pbs_comm
           ├─1447 /opt/pbs/sbin/pbs_server.bin
           ├─1807 /opt/pbs/sbin/pbs_ds_monitor monitor
           ├─1839 /usr/bin/postgres -D /pbs_spool/datastore -p 15007
           ├─1849 postgres: logger process
           ├─1851 postgres: checkpointer process
           ├─1852 postgres: writer process
           ├─1853 postgres: wal writer process
           ├─1854 postgres: autovacuum launcher process
           ├─1855 postgres: stats collector process
           ├─1856 postgres: bgworker: logical replication launcher
           ├─1892 postgres: postgres pbs_datastore 172.16.47.137(36796) idle
           └─1905 /opt/pbs/sbin/pbs_sched
[davide@gmtest03 ~]$ ./concatenation.sh
seed ID 1014.gmtest01.sched.local
ID n. 1 is 1015.gmtest01.sched.local
ID n. 2 is 1016.gmtest01.sched.local
[davide@gmtest03 ~]$ qstat -n1

gmtest01.sched.local:
                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1014.gmtest01.* davide   workq    test_seed    1611   1   1    1gb 00:01 R 00:00 gmtest04/0
1015.gmtest01.* davide   workq    test_seed     --    1   1    1gb 00:01 H   --   --
1016.gmtest01.* davide   workq    test_seed     --    1   1    1gb 00:01 H   --   --

and pbs_hostn -v resolves the short name:

[root@gmtest02 ~]# pbs_hostn -v gmtest01
primary name: gmtest01.sched.local (from gethostbyname())
aliases:           gmtest01
     address length:  4 bytes
     address:        172.16.47.137   (2301563052 dec)  name:  gmtest01.sched.local

not sure if the two things are correlated but these are the differences that I found. To me the DNS is working properly. What else I can check?

Topic		Replies	Views
Neither primary or secondary server Users/Site Administrators	6	1006	July 30, 2021
Comm daemon and failover Users/Site Administrators	17	1636	August 20, 2021
PBS_PRIMARY/PBS_SECONDARY vs PBS_LEAF_NAME Users/Site Administrators	6	1171	July 29, 2021
Job dependency when submission host different from server host Users/Site Administrators	3	364	November 3, 2022
Job gets stuck in a queue after a fresh install Users/Site Administrators	12	4402	May 31, 2019

Secondary server does manage job dependency

Related topics