Job gets stuck in a queue after a fresh install

Hello folks,

I installed pbspro 19.1.1 on a clean CentOS 7 virtual machine, but failed to get a single job running in the queue.

When I start the pbs under root, everything looks good to me:

[root@pbshost ~]# /etc/init.d/pbs start
Starting PBS
PBS comm
/opt/pbs/sbin/pbs_comm ready (pid=37824), Proxy Name:pbshost:17001, Threads:4
PBS mom
Creating usage database for fairshare.
PBS sched
Connecting to PBS dataservice…connected to PBS dataservice@pbshost
Licenses valid for 10000000 Floating hosts
PBS server

Then I check the pbs status, and again everything seems to be working fine.

[root@pbshost ~]# /etc/init.d/pbs status
pbs_server is pid 38207
pbs_mom is pid 37854
pbs_sched is pid 37866
pbs_comm is 37824

At this point, if I go ahead and submit a job, it just sits in the queue and does not get executed. To figure out what is happening, if I exit the root mode and check the pbs status again, it shows that only pbs_mom is running:

[hanxiao@localhost mail]$ /etc/init.d/pbs status
pbs_server is not running
pbs_mom is pid 37854
pbs_sched is not running
pbs_comm is not running

When I check the server_logs file, I see the following message. It shows a ‘Failed to resolve address’ error:

05/27/2019 20:44:13;0002;Server@pbshost;Svr;Server@pbshost;connected to PBS dataservice@pbshost
05/27/2019 20:44:13;0086;Server@pbshost;Svr;pbs_python_ext_quick_start_interpreter;–> Python Interpreter quick started, compiled with version:'2.7.5 (default, Apr 9 2019, 14:30:50)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]’ <–
05/27/2019 20:44:13;0086;Server@pbshost;Svr;pbs_python_ext_quick_start_interpreter;–> Inserted Altair PBS Python modules dir ‘/opt/pbs/lib/python/altair’ <–
05/27/2019 20:44:13;0002;Server@pbshost;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Main Thread);TPP leaf node names = 192.168.83.130:15001,127.0.0.1:15001,192.168.83.130:15001,192.168.122.1:15001
05/27/2019 20:44:13;0002;Server@pbshost;Svr;Server@pbshost;Server pid = 38207 ready; using ports Server:15001 Scheduler:15004 MOM:15002 RM:15003
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Thread ready
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Registering address 192.168.83.130:15001 to pbs_comm
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Registering address 192.168.122.1:15001 to pbs_comm
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Connected to pbs_comm pbshost:17001
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;tpp_open(Main Thread);Failed to resolve address, err=0
05/27/2019 20:44:13;0001;Server@pbshost;Svr;Server@pbshost;Success (0) in mom_ping_need, rpp_open to localhost, port 15003

Here is where I’m really confused. I have set PBS_SERVER to be pbshost, and added the ip of pbshost in the /etc/hosts file.

pbs.conf

PBS_EXEC=/opt/pbs
PBS_SERVER=pbshost
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

and the /etc/hosts file:

127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.83.130 pbshost

The ip address ‘192.168.83.130’ is what I got when I ran the ifconfig command:

[root@pbshost ~]# ifconfig **
eno16777736: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
** inet 192.168.83.130 netmask 255.255.255.0 broadcast 192.168.83.255

** inet6 fe80::20c:29ff:fe53:4996 prefixlen 64 scopeid 0x20**
** ether 00:0c:29:53:49:96 txqueuelen 1000 (Ethernet)**
** RX packets 3860 bytes 299161 (292.1 KiB)**
** RX errors 0 dropped 0 overruns 0 frame 0**
** TX packets 627 bytes 58139 (56.7 KiB)**
** TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0**

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
** inet 127.0.0.1 netmask 255.0.0.0**
** inet6 ::1 prefixlen 128 scopeid 0x10**
** loop txqueuelen 1000 (Local Loopback)**
** RX packets 11126 bytes 1582114 (1.5 MiB)**
** RX errors 0 dropped 0 overruns 0 frame 0**
** TX packets 11126 bytes 1582114 (1.5 MiB)**
** TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0**

virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
** inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255**
** ether 52:54:00:76:80:42 txqueuelen 1000 (Ethernet)**
** RX packets 0 bytes 0 (0.0 B)**
** RX errors 0 dropped 0 overruns 0 frame 0**
** TX packets 0 bytes 0 (0.0 B)**
** TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0**

Not sure what I did wrong here. Any help is appreciated!! THANKS!!

The pbs_server was running at this point. As it could accept the job and queue it.

It would be good , if you could you please check:

  1. SELinux is disabled
  2. Ports 15001 to 15007 and 17001 is opened between the components of the PBS Pro cluster
  3. Firewalls are disabled
  4. Whether PBS_SERVER is resolvable ( forward and reverse ) to the hostname/IP address
    #pbs_hostn -v pbshost ( run it from the compute node(s) and PBS Server)
  5. Run an strace on the server when it is running and submit a job and check strace info.

Thank you for the quick response adarsh!

I followed your advice and checked those items, but couldn’t find anything wrong.

  1. The SELinux is indeed disabled.

[hanxiao@localhost sbin]$ sestatus
SELinux status: disabled

  1. Since I’m running on a virtual machine, and the server is the same as execution hosts. I used nmap to scan pbshost, and found that ports 15002 ~ 15004 are open. 15001, 17001, 15005, etc. are not shown to be open, but not blocked either. Could this be a problem?

[hanxiao@localhost sbin]$ nmap pbshost

Starting Nmap 6.40 ( http://nmap.org ) at 2019-05-28 01:08 PDT
Nmap scan report for pbshost (192.168.83.130)
Host is up (0.00021s latency).
Not shown: 995 closed ports
PORT STATE SERVICE
22/tcp open ssh
111/tcp open rpcbind
15002/tcp open unknown
15003/tcp open unknown
15004/tcp open unknown

Nmap done: 1 IP address (1 host up) scanned in 0.08 seconds

  1. I have made sure that the firewall is disabled.

[root@pbshost ~]# sudo firewall-cmd --state
not running

  1. PBS_SERVER seems to be resolvable to me after I ran the pbs_hostn command

[root@pbshost ~]# pbs_hostn -v pbshost
primary name: pbshost (from gethostbyname())
aliases: -none-
** address length: 4 bytes**
** address: 192.168.83.130 (2186520768 dec) name: pbshost**

I don’t really know how to run a strace on the server, and I’m still confused whether or not the server is running or not. From the message below, you can see that after I applied a ‘qsub’, a job with id=14 is created. If I do a qstat, I can see that 14.pbshost is in the queue. But when I check the status, it shows that pbs_server is not running.

==================================================================
[hanxiao@localhost sbin]$ echo ‘sleep 30’ | qsub

14.pbshost

[hanxiao@localhost sbin]$ /etc/init.d/pbs status

pbs_server is not running
pbs_mom is pid 44762
pbs_sched is not running
pbs_comm is not running
[hanxiao@localhost sbin]$ qstat
Job id Name User Time Use S Queue

11.pbshost STDIN hanxiao 0 Q m1
14.pbshost STDIN hanxiao 0 Q m1

==================================================================

Any more ideas on what’s going on?

Thank you for running these checks and sharing the information

  1. please check and share the output of this command
    ps -ef | grep pbs_

  2. Please check the status as below
    systemctl status pbs

  3. Please share the output of the below commands
    pbsnodes -av
    qstat -answ1

Please find the outputs below:

  1. [hanxiao@localhost sbin]$ ps -ef | grep pbs_
    root 44732 1 0 00:53 ? 00:00:00 /opt/pbs/sbin/pbs_comm
    root 44762 1 0 00:53 ? 00:00:00 /opt/pbs/sbin/pbs_mom
    root 44774 1 0 00:53 ? 00:00:00 /opt/pbs/sbin/pbs_sched
    root 45041 1 0 00:53 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
    postgres 45109 45090 0 00:53 ? 00:00:00 postgres: postgres pbs_datastore 192.168.83.130(39324) idle
    root 45118 1 0 00:53 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
    hanxiao 46028 21538 0 01:33 pts/0 00:00:00 grep --color=auto pbs_

  2. [hanxiao@localhost sbin]$ systemctl status pbs
    ● pbs.service - Portable Batch System
    Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
    Active: inactive (dead) since Mon 2019-05-27 17:33:09 PDT; 8h ago
    Docs: man:pbs(8)

  3. [hanxiao@localhost sbin]$ pbsnodes -av
    localhost
    Mom = localhost
    ntype = PBS
    state = state-unknown,down
    pcpus = 1
    resources_available.host = localhost
    resources_available.ncpus = 1
    resources_available.vnode = localhost
    resources_assigned.accelerator_memory = 0kb
    resources_assigned.hbmem = 0kb
    resources_assigned.mem = 0kb
    resources_assigned.naccelerators = 0
    resources_assigned.ncpus = 0
    resources_assigned.vmem = 0kb
    resv_enable = True
    sharing = default_shared

[hanxiao@localhost sbin]$ qstat -answ1

pbshost:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


11.pbshost hanxiao m1 STDIN – 1 1 – – Q – –
Not Running: Not enough free nodes available
14.pbshost hanxiao m1 STDIN – 1 1 – – Q – –
Not Running: Not enough free nodes available

Thank you for the information

  1. Your pbs services are up and running, but due name resolution mom is down
  2. Probably the /etc/init.d/pbs script is broken
  3. Please follow the below steps :
    #qmgr -c “d n @default
    #qmgr -c “create node pbshost”
  4. systemctl restart pbs
  5. pbsnodes -av

I started with the two qmgr commands, but encountered some ‘Unknown Host’ errors. Did I miss something?

[root@pbshost ~]# qmgr -c “d n @default
Unknown Host.
qmgr: cannot connect to server n
Unknown Host.
qmgr: cannot connect to server @default
[root@pbshost ~]# qmgr -c “create node pbshost”
Unknown Host.
qmgr: cannot connect to server node
Unknown Host.
qmgr: cannot connect to server pbshost”

Please set the static hostname on your system. If you see the shell prompt it is still using localhost.

  1. Check these commands
    hostname
    hostname -f
    ping pbshost
    nslookup pbshost
  2. hostnamectl set-hostname pbshost
  3. check /etc/hosts against pbshost
  4. reboot the system

Hey,

Good catch! Not sure why it showed as localhost, but I had set it as pbshost before and it shows up as pbshost now.

I tried your suggestions, and the nslookup command seems to expose something wrong:

[hanxiao@pbshost ~]$ hostname
pbshost

[hanxiao@pbshost ~]$ hostname -f
pbshost

[hanxiao@pbshost ~]$ ping pbshost

PING pbshost (192.168.83.130) 56(84) bytes of data.
64 bytes from pbshost (192.168.83.130): icmp_seq=1 ttl=64 time=0.076 ms
64 bytes from pbshost (192.168.83.130): icmp_seq=2 ttl=64 time=0.078 ms
64 bytes from pbshost (192.168.83.130): icmp_seq=3 ttl=64 time=0.054 ms
64 bytes from pbshost (192.168.83.130): icmp_seq=4 ttl=64 time=0.055 ms
64 bytes from pbshost (192.168.83.130): icmp_seq=5 ttl=64 time=0.081 ms
64 bytes from pbshost (192.168.83.130): icmp_seq=6 ttl=64 time=0.073 ms
64 bytes from pbshost (192.168.83.130): icmp_seq=7 ttl=64 time=0.095 ms
^C
— pbshost ping statistics —
7 packets transmitted, 7 received, 0% packet loss, time 6008ms
rtt min/avg/max/mdev = 0.054/0.073/0.095/0.014 ms
[hanxiao@pbshost ~]$ nslookup pbshost
Server: 192.168.83.2
Address: 192.168.83.2#53

** server can’t find pbshost: SERVFAIL

The nslookup seems to identify pbshost as 192.168.83.2, but in my /etc/hosts file, it is set as 192.168.83.130. This is also confirmed by the output of the ping command.

You do not have a DNS server setup and hence the output from the nslookup looks reasonable to me.
Server: 192.168.83.2 # is the name server information. ( /etc/resolv.conf )

Thanks a lot for all your input adarsh. I’ll probably switch to another version of CentOS and keep trying.

1 Like

hey adarsh,

I figured out what was wrong with my setup. Even if I set pbshost as the server, when I created the node, the node name was set to localhost. After I corrected this, everything was just fine.

Thanks again for your input!

Thank you @hanxiao . Nice one !