Test job waits in queue

watzinki · March 1, 2022, 9:05am

Hi All,
I am new to pbspro. I installed PBS server and execute rpm packages on the master and the slave node respectively. I tried to set up everything following instructions on the web, however, when I run a test job it waits in the queue forever. Any help would be appreciated.

adarsh · March 1, 2022, 11:04am

Please share the output of the below command:

qstat -answ1

pbsnodes -av

qmgr -c “print server”

watzinki · March 1, 2022, 3:14pm

I actually have one master node called hep-node0 and a slave node. I actually set up the master node in a way so that I could use it for job execution as well.

Below is the output of the commands I executed as directed.

[ali_0@hep-node0 ~]$ qstat -answ1
hep-node0:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

1014.hep-node0 ali_0 batch example-job.sh – 1 1 – 00:00 Q – –
Not Running: Not enough free nodes available
1015.hep-node0 ali_0 batch STDIN – 1 1 – 00:00 Q – –
Not Running: Not enough free nodes available
[ali_0@hep-node0 ~]$ pbsnodes -av
ali_2
Mom = hep-node2
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = hep-node2
resources_available.ncpus = 1
resources_available.vnode = ali_2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
comment = node down: communication closed
resv_enable = True
sharing = default_shared
last_state_change_time = Mon Feb 28 23:02:12 2022

[ali_0@hep-node0 ~]$ qmgr -c “print server”
Unknown Host.
qmgr: cannot connect to server server”

adarsh · March 1, 2022, 3:53pm

Please check the output of pbsnodes -av

node status is down reason might be

The pbs_mom service is down ( systemctl status pbs # on the compute node ) or ps -ef | grep pbs_mom
hence your job is in the queued status , there aren’t enough resources available to run your job.

Make sure

pbs_mom service is up and running
cat /etc/pbs.conf | grep PBS_START_MOM
PBS_START_MOM=1

Nothing wrong with this , you need to type the command instead of copy pasting it.
Copy pasting it has some special character and hence it fails.

watzinki · March 1, 2022, 5:38pm

Thank you for your replies. pbs_mom service seems to be up and running on the master node “hep-node0”. Below is the output of “systemctl status PBS” command.
[ali_0@hep-node0 ~]$ systemctl status pbs
● pbs.service - Portable Batch System
Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
Active: active (running) since Mon 2022-02-28 23:02:07 EET; 1 day 2h ago
Docs: man:pbs(8)
Process: 1245 ExecStart=/opt/pbs/libexec/pbs_init.d start (code=exited, status=0/SUCCESS)
Tasks: 14
Memory: 17.1M
CGroup: /system.slice/pbs.service
├─1412 /opt/pbs/sbin/pbs_comm
├─1466 /opt/pbs/sbin/pbs_mom
├─1516 /opt/pbs/sbin/pbs_sched
├─1877 /opt/pbs/sbin/pbs_ds_monitor monitor
├─1921 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
├─1929 postgres: logger process
├─1931 postgres: checkpointer process
├─1932 postgres: writer process
├─1933 postgres: wal writer process
├─1934 postgres: autovacuum launcher process
├─1935 postgres: stats collector process
├─2075 postgres: postgres pbs_datastore 192.168.1.1(59718) idle
└─2076 /opt/pbs/sbin/pbs_server.bin

Feb 28 23:02:06 hep-node0.com systemd[1]: Starting Portable Batch System…
Feb 28 23:02:07 hep-node0.com systemd[1]: Started Portable Batch System.
Feb 28 23:02:07 hep-node0.com su[1610]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1687]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1754]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1792]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1878]: (to postgres) root on none
Feb 28 23:02:12 hep-node0.com pbs_init.d[1245]: Starting PBS in background

[ali_0@hep-node0 ~]$ qmgr -c “print server”

Create queues and set their attributes.

Create and define queue batch

create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.ncpus = 1
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 00:00:36
set queue batch enabled = True
set queue batch started = True

Set server attributes.

set server scheduling = True
set server acl_roots = username@*
set server operators = username@*
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999

adarsh · March 1, 2022, 9:47pm

Thank you for the above information

This should be also hep-node2 , not sure why it is ali_2.

For example, a similar setup like yours, would have this output:
[root@demo ~]# pbsnodes -av | grep demo
demo
     Mom = demo
     resources_available.host = demo
     resources_available.vnode = demo
[root@demo ~]# cat /etc/hosts | grep demo
192.168.64.128 demo
[root@demo ~]# cat /etc/pbs.conf | grep MOM
PBS_START_MOM=1
[root@demo ~]# cat /var/spool/pbs/mom_priv/config | grep client
$clienthost demo
[root@demo ~]# ps -ef | grep pbs_mom
root        1620       1  0 Feb28 ?        00:00:07 /opt/pbs/sbin/pbs_mom
root     1129595 1126933  0 21:45 pts/0    00:00:00 grep --color=auto pbs_mom

Please make sure

selinux is disabled and system is rebooted
ports 15001 to 15009 and 17001 is open for communication (or firewall / ip tables completely disabled)
static IP address and hostname and /etc/hosts is up-to-date / DNS resolvable

watzinki · March 2, 2022, 7:14am

Dear @adarsh , thanks again for your answers. I got the master node running. I can run the jobs on it, but cannot run on the slave node, which is the ali_2 machine or whose hostname is “hep-node2”.

selinux already disabled and rebooted on both machines. I established ssh connection between two machines with passwordless access to each other as well.
With “systemctl status pbs” on the slave node, I see that PBS running on the node. However, when I tried to set the ali_2 machine as slave node via the command qmgr -c “create node hep-node2”, I got the error message “No route to host
qmgr: cannot connect to server” .So, my naive first thought was that this is due to the requirement for the ports to be opened. If you think the way like me, could you please elaborate a bit on how to get these ports opened on centos 7? Also, do these ports have to be open on both slave and master nodes or on the slave only?

adarsh · March 2, 2022, 8:10am

Please check

DNS / Static IP / hostname / check /etc/hosts ( network address resolution is important)
systemctl stop firewalld ; systemctl disable firewalld
#otherwise add the above mentioned ports to be allowed in firewalld
firewall-cmd --zone=public --permanent --add-port=15001/tcp # for all ports
firewall-cmd --reload

   firewall-cmd --zone=public --permanent --add-port=15001/tcp  # for all ports
   firewall-cmd --reload

Yes

If you are interested, please refer: Building a PBS Professional Virtual Test Cluster with Ubuntu

watzinki · March 2, 2022, 11:09am

Thank you again!

Seems the master node sees the slave or vice versa.
After firewall settings, pbsnodes -av output on both master node and slave node as the following.
One more question. Is there a way to check if the submitted jobs are executed on the slave node as well? Although it seems nodes can contact each other, I doubt, the test job is only running on the master node.

hep-node0
Mom = hep-node0
ntype = PBS
state = free
pcpus = 4
resources_available.arch = linux
resources_available.host = hep-node0
resources_available.mem = 16265032kb
resources_available.ncpus = 4
resources_available.vnode = hep-node0
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 2 12:08:35 2022
last_used_time = Wed Mar 2 12:20:41 2022

hep-node2
Mom = hep-node2
ntype = PBS
state = free
pcpus = 1
resources_available.arch = linux
resources_available.host = hep-node2
resources_available.mem = 8007520kb
resources_available.ncpus = 4
resources_available.vnode = hep-node2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 2 12:08:35 2022

adarsh · March 3, 2022, 7:51am

Nice one !

Try this :

submit as below : 
qsub -l select=1:ncpus=1   -l place=excl  -- /bin/sleep 1000
qsub -l select=1:ncpus=1   -l place=excl  -- /bin/sleep 1000
qstat -answ1  # this command should show  you on which node(s) the job is running
ssh  <node> 
ps -ef | grep sleep

watzinki · March 3, 2022, 9:02am

@adarsh I really appreciate your help and patience thus far. As I thought in the previous message, jobs are not sent to slave nodes but run on the master node only instead. qstat -answ1 command outputs the following.
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

2047.hep-node0 ali_0 batch STDIN 5732 1 1 – 200:0 R 00:02 hep-node0/0
Job run at Thu Mar 03 at 17:05 on (hep-node0:ncpus=1)
2048.hep-node0 ali_0 batch STDIN – 1 1 – 200:0 H – –
job held, too many failed attempts to run

Also, I had opened the ports you mentioned above using the commands as you directed. However, some of the ports still seem to be closed. This might be the reason?

[ali_0@hep-node0 ~]$ sudo nmap -p 15001-15009 192.168.1.1
[sudo] password for ali_0:

Starting Nmap 6.40 ( http://nmap.org ) at 2022-03-03 17:17 EET
Nmap scan report for hep-node0 (192.168.1.1)
Host is up (0.00012s latency).
PORT STATE SERVICE
15001/tcp open unknown
15002/tcp open unknown
15003/tcp open unknown
15004/tcp open unknown
15005/tcp closed unknown
15006/tcp closed unknown
15007/tcp open unknown
15008/tcp closed unknown
15009/tcp closed unknown

adarsh · March 3, 2022, 4:13pm

Thank you @watzinki , pleasure !

Please check the mom logs of the second node
source /etc/pbs.conf ; cd $PBS_HOME/mom_logs and check the logs for the day (YYYYMMDD)

Suspect on the second node:

user account does not exist or passwd not set
home directory missing
permission issues
The mom logs should be able to tell you the issue

watzinki · March 6, 2022, 9:08pm

Dear @adarsh, I have been trying to figure out what causing the issue by myself for the last few days but no luck thus far. I set passwordless login among the nodes. I can test it by sshing from one node to another, back and forth without entering the password. However, It still seems no job is executed on the slave node but master node only. Below is the dump of the log file of the slave node called hep-node2. Hope you might have other suggestions.
Thanks in advance.

03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;hep-node2 cput=00:00:00 mem=0kb
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;Obit sent
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0080;pbs_mom;Job;2067.hep-node0;delete job request received
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;hep-node2 cput=00:00:00 mem=0kb
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;Obit sent
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0080;pbs_mom;Job;2067.hep-node0;delete job request received
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job

adarsh · March 7, 2022, 7:52am

This is the issue. It seems there is issue with user account and password or password is not set.

Topic		Replies	Views
My job stay queued Users/Site Administrators	24	10512	January 27, 2020
Setting up openPBS on only one computer Users/Site Administrators	16	3864	October 27, 2020
Job gets stuck in a queue after a fresh install Users/Site Administrators	12	4417	May 31, 2019
Could not create node Users/Site Administrators	10	2510	May 26, 2020
How to write a script for a program run in two hosts? Users/Site Administrators	21	4782	January 17, 2019

Test job waits in queue

Create queues and set their attributes.

Create and define queue batch

Set server attributes.

Related topics