Test job waits in queue

Hi All,
I am new to pbspro. I installed PBS server and execute rpm packages on the master and the slave node respectively. I tried to set up everything following instructions on the web, however, when I run a test job it waits in the queue forever. Any help would be appreciated.

Please share the output of the below command:

  1. qstat -answ1
  2. pbsnodes -av
  3. qmgr -c “print server”

I actually have one master node called hep-node0 and a slave node. I actually set up the master node in a way so that I could use it for job execution as well.

Below is the output of the commands I executed as directed.

[ali_0@hep-node0 ~]$ qstat -answ1
hep-node0:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


1014.hep-node0 ali_0 batch example-job.sh – 1 1 – 00:00 Q – –
Not Running: Not enough free nodes available
1015.hep-node0 ali_0 batch STDIN – 1 1 – 00:00 Q – –
Not Running: Not enough free nodes available
[ali_0@hep-node0 ~]$ pbsnodes -av
ali_2
Mom = hep-node2
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = hep-node2
resources_available.ncpus = 1
resources_available.vnode = ali_2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
comment = node down: communication closed
resv_enable = True
sharing = default_shared
last_state_change_time = Mon Feb 28 23:02:12 2022

[ali_0@hep-node0 ~]$ qmgr -c “print server”
Unknown Host.
qmgr: cannot connect to server server”

Please check the output of pbsnodes -av

  1. node status is down reason might be
  • The pbs_mom service is down ( systemctl status pbs # on the compute node ) or ps -ef | grep pbs_mom
  • hence your job is in the queued status , there aren’t enough resources available to run your job.

Make sure

  • pbs_mom service is up and running
  • cat /etc/pbs.conf | grep PBS_START_MOM
    PBS_START_MOM=1

Nothing wrong with this , you need to type the command instead of copy pasting it.
Copy pasting it has some special character and hence it fails.

Thank you for your replies. pbs_mom service seems to be up and running on the master node “hep-node0”. Below is the output of “systemctl status PBS” command.
[ali_0@hep-node0 ~]$ systemctl status pbs
● pbs.service - Portable Batch System
Loaded: loaded (/opt/pbs/libexec/pbs_init.d; enabled; vendor preset: disabled)
Active: active (running) since Mon 2022-02-28 23:02:07 EET; 1 day 2h ago
Docs: man:pbs(8)
Process: 1245 ExecStart=/opt/pbs/libexec/pbs_init.d start (code=exited, status=0/SUCCESS)
Tasks: 14
Memory: 17.1M
CGroup: /system.slice/pbs.service
├─1412 /opt/pbs/sbin/pbs_comm
├─1466 /opt/pbs/sbin/pbs_mom
├─1516 /opt/pbs/sbin/pbs_sched
├─1877 /opt/pbs/sbin/pbs_ds_monitor monitor
├─1921 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
├─1929 postgres: logger process
├─1931 postgres: checkpointer process
├─1932 postgres: writer process
├─1933 postgres: wal writer process
├─1934 postgres: autovacuum launcher process
├─1935 postgres: stats collector process
├─2075 postgres: postgres pbs_datastore 192.168.1.1(59718) idle
└─2076 /opt/pbs/sbin/pbs_server.bin

Feb 28 23:02:06 hep-node0.com systemd[1]: Starting Portable Batch System…
Feb 28 23:02:07 hep-node0.com systemd[1]: Started Portable Batch System.
Feb 28 23:02:07 hep-node0.com su[1610]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1687]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1754]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1792]: (to postgres) root on none
Feb 28 23:02:07 hep-node0.com su[1878]: (to postgres) root on none
Feb 28 23:02:12 hep-node0.com pbs_init.d[1245]: Starting PBS in background

[ali_0@hep-node0 ~]$ qmgr -c “print server”

Create queues and set their attributes.

Create and define queue batch

create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.ncpus = 1
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 00:00:36
set queue batch enabled = True
set queue batch started = True

Set server attributes.

set server scheduling = True
set server acl_roots = username@*
set server operators = username@*
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999

Thank you for the above information

This should be also hep-node2 , not sure why it is ali_2.

For example, a similar setup like yours, would have this output:
[root@demo ~]# pbsnodes -av | grep demo
demo
     Mom = demo
     resources_available.host = demo
     resources_available.vnode = demo
[root@demo ~]# cat /etc/hosts | grep demo
192.168.64.128 demo
[root@demo ~]# cat /etc/pbs.conf | grep MOM
PBS_START_MOM=1
[root@demo ~]# cat /var/spool/pbs/mom_priv/config | grep client
$clienthost demo
[root@demo ~]# ps -ef | grep pbs_mom
root        1620       1  0 Feb28 ?        00:00:07 /opt/pbs/sbin/pbs_mom
root     1129595 1126933  0 21:45 pts/0    00:00:00 grep --color=auto pbs_mom

Please make sure

  1. selinux is disabled and system is rebooted
  2. ports 15001 to 15009 and 17001 is open for communication (or firewall / ip tables completely disabled)
  3. static IP address and hostname and /etc/hosts is up-to-date / DNS resolvable

Dear @adarsh , thanks again for your answers. I got the master node running. I can run the jobs on it, but cannot run on the slave node, which is the ali_2 machine or whose hostname is “hep-node2”.

selinux already disabled and rebooted on both machines. I established ssh connection between two machines with passwordless access to each other as well.
With “systemctl status pbs” on the slave node, I see that PBS running on the node. However, when I tried to set the ali_2 machine as slave node via the command qmgr -c “create node hep-node2”, I got the error message “No route to host
qmgr: cannot connect to server” .So, my naive first thought was that this is due to the requirement for the ports to be opened. If you think the way like me, could you please elaborate a bit on how to get these ports opened on centos 7? Also, do these ports have to be open on both slave and master nodes or on the slave only?

Please check

  1. DNS / Static IP / hostname / check /etc/hosts ( network address resolution is important)
  2. systemctl stop firewalld ; systemctl disable firewalld
    #otherwise add the above mentioned ports to be allowed in firewalld
    firewall-cmd --zone=public --permanent --add-port=15001/tcp # for all ports
    firewall-cmd --reload
   firewall-cmd --zone=public --permanent --add-port=15001/tcp  # for all ports
   firewall-cmd --reload

Yes

If you are interested, please refer: Building a PBS Professional Virtual Test Cluster with Ubuntu

Thank you again!

Seems the master node sees the slave or vice versa.
After firewall settings, pbsnodes -av output on both master node and slave node as the following.
One more question. Is there a way to check if the submitted jobs are executed on the slave node as well? Although it seems nodes can contact each other, I doubt, the test job is only running on the master node.

hep-node0
Mom = hep-node0
ntype = PBS
state = free
pcpus = 4
resources_available.arch = linux
resources_available.host = hep-node0
resources_available.mem = 16265032kb
resources_available.ncpus = 4
resources_available.vnode = hep-node0
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 2 12:08:35 2022
last_used_time = Wed Mar 2 12:20:41 2022

hep-node2
Mom = hep-node2
ntype = PBS
state = free
pcpus = 1
resources_available.arch = linux
resources_available.host = hep-node2
resources_available.mem = 8007520kb
resources_available.ncpus = 4
resources_available.vnode = hep-node2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed Mar 2 12:08:35 2022

Nice one !

Try this :

submit as below : 
qsub -l select=1:ncpus=1   -l place=excl  -- /bin/sleep 1000
qsub -l select=1:ncpus=1   -l place=excl  -- /bin/sleep 1000
qstat -answ1  # this command should show  you on which node(s) the job is running
ssh  <node> 
ps -ef | grep sleep

@adarsh I really appreciate your help and patience thus far. As I thought in the previous message, jobs are not sent to slave nodes but run on the master node only instead. qstat -answ1 command outputs the following.
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


2047.hep-node0 ali_0 batch STDIN 5732 1 1 – 200:0 R 00:02 hep-node0/0
Job run at Thu Mar 03 at 17:05 on (hep-node0:ncpus=1)
2048.hep-node0 ali_0 batch STDIN – 1 1 – 200:0 H – –
job held, too many failed attempts to run

Also, I had opened the ports you mentioned above using the commands as you directed. However, some of the ports still seem to be closed. This might be the reason?

[ali_0@hep-node0 ~]$ sudo nmap -p 15001-15009 192.168.1.1
[sudo] password for ali_0:

Starting Nmap 6.40 ( http://nmap.org ) at 2022-03-03 17:17 EET
Nmap scan report for hep-node0 (192.168.1.1)
Host is up (0.00012s latency).
PORT STATE SERVICE
15001/tcp open unknown
15002/tcp open unknown
15003/tcp open unknown
15004/tcp open unknown
15005/tcp closed unknown
15006/tcp closed unknown
15007/tcp open unknown
15008/tcp closed unknown
15009/tcp closed unknown

Thank you @watzinki , pleasure !

Please check the mom logs of the second node
source /etc/pbs.conf ; cd $PBS_HOME/mom_logs and check the logs for the day (YYYYMMDD)

Suspect on the second node:

  1. user account does not exist or passwd not set
  2. home directory missing
  3. permission issues
  4. The mom logs should be able to tell you the issue

Dear @adarsh, I have been trying to figure out what causing the issue by myself for the last few days but no luck thus far. I set passwordless login among the nodes. I can test it by sshing from one node to another, back and forth without entering the password. However, It still seems no job is executed on the slave node but master node only. Below is the dump of the log file of the slave node called hep-node2. Hope you might have other suggestions.
Thanks in advance.

03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;hep-node2 cput=00:00:00 mem=0kb
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;Obit sent
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0080;pbs_mom;Job;2067.hep-node0;delete job request received
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;hep-node2 cput=00:00:00 mem=0kb
03/06/2022 21:00:12;0100;pbs_mom;Job;2067.hep-node0;Obit sent
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 6 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0080;pbs_mom;Job;2067.hep-node0;delete job request received
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 1 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0100;pbs_mom;Req;;Type 5 request received from root@192.168.1.1:15001, sock=1
03/06/2022 21:00:12;0028;pbs_mom;Job;2067.hep-node0;No Password Entry for User ali_0
03/06/2022 21:00:12;0008;pbs_mom;Job;2067.hep-node0;kill_job

This is the issue. It seems there is issue with user account and password or password is not set.