Hello folks,
I installed pbspro 19.1.1 on a clean CentOS 7 virtual machine, but failed to get a single job running in the queue.
When I start the pbs under root, everything looks good to me:
[root@pbshost ~]# /etc/init.d/pbs start
Starting PBS
PBS comm
/opt/pbs/sbin/pbs_comm ready (pid=37824), Proxy Name:pbshost:17001, Threads:4
PBS mom
Creating usage database for fairshare.
PBS sched
Connecting to PBS dataservice…connected to PBS dataservice@pbshost
Licenses valid for 10000000 Floating hosts
PBS server
Then I check the pbs status, and again everything seems to be working fine.
[root@pbshost ~]# /etc/init.d/pbs status
pbs_server is pid 38207
pbs_mom is pid 37854
pbs_sched is pid 37866
pbs_comm is 37824
At this point, if I go ahead and submit a job, it just sits in the queue and does not get executed. To figure out what is happening, if I exit the root mode and check the pbs status again, it shows that only pbs_mom is running:
[hanxiao@localhost mail]$ /etc/init.d/pbs status
pbs_server is not running
pbs_mom is pid 37854
pbs_sched is not running
pbs_comm is not running
When I check the server_logs file, I see the following message. It shows a ‘Failed to resolve address’ error:
05/27/2019 20:44:13;0002;Server@pbshost;Svr;Server@pbshost;connected to PBS dataservice@pbshost
05/27/2019 20:44:13;0086;Server@pbshost;Svr;pbs_python_ext_quick_start_interpreter;–> Python Interpreter quick started, compiled with version:'2.7.5 (default, Apr 9 2019, 14:30:50)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]’ <–
05/27/2019 20:44:13;0086;Server@pbshost;Svr;pbs_python_ext_quick_start_interpreter;–> Inserted Altair PBS Python modules dir ‘/opt/pbs/lib/python/altair’ <–
05/27/2019 20:44:13;0002;Server@pbshost;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Main Thread);TPP leaf node names = 192.168.83.130:15001,127.0.0.1:15001,192.168.83.130:15001,192.168.122.1:15001
05/27/2019 20:44:13;0002;Server@pbshost;Svr;Server@pbshost;Server pid = 38207 ready; using ports Server:15001 Scheduler:15004 MOM:15002 RM:15003
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Thread ready
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Registering address 192.168.83.130:15001 to pbs_comm
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Registering address 192.168.122.1:15001 to pbs_comm
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;Server@pbshost(Thread 0);Connected to pbs_comm pbshost:17001
05/27/2019 20:44:13;0c06;Server@pbshost;TPP;tpp_open(Main Thread);Failed to resolve address, err=0
05/27/2019 20:44:13;0001;Server@pbshost;Svr;Server@pbshost;Success (0) in mom_ping_need, rpp_open to localhost, port 15003
Here is where I’m really confused. I have set PBS_SERVER to be pbshost, and added the ip of pbshost in the /etc/hosts file.
pbs.conf
PBS_EXEC=/opt/pbs
PBS_SERVER=pbshost
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp
and the /etc/hosts file:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.83.130 pbshost
The ip address ‘192.168.83.130’ is what I got when I ran the ifconfig command:
[root@pbshost ~]# ifconfig **
eno16777736: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
** inet 192.168.83.130 netmask 255.255.255.0 broadcast 192.168.83.255
** inet6 fe80::20c:29ff:fe53:4996 prefixlen 64 scopeid 0x20**
** ether 00:0c:29:53:49:96 txqueuelen 1000 (Ethernet)**
** RX packets 3860 bytes 299161 (292.1 KiB)**
** RX errors 0 dropped 0 overruns 0 frame 0**
** TX packets 627 bytes 58139 (56.7 KiB)**
** TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0**
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
** inet 127.0.0.1 netmask 255.0.0.0**
** inet6 ::1 prefixlen 128 scopeid 0x10**
** loop txqueuelen 1000 (Local Loopback)**
** RX packets 11126 bytes 1582114 (1.5 MiB)**
** RX errors 0 dropped 0 overruns 0 frame 0**
** TX packets 11126 bytes 1582114 (1.5 MiB)**
** TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0**
virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
** inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255**
** ether 52:54:00:76:80:42 txqueuelen 1000 (Ethernet)**
** RX packets 0 bytes 0 (0.0 B)**
** RX errors 0 dropped 0 overruns 0 frame 0**
** TX packets 0 bytes 0 (0.0 B)**
** TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0**
Not sure what I did wrong here. Any help is appreciated!! THANKS!!