Jobs stuck in queue, need to use root access to qrun to initiate each job

Dear All,

I recently installed OpenPBS version 20.0.0 and am currently trying to overcome a problem with scheduling my submitted jobs. I was able to create a queue (named batch), and submit jobs to this queue, however, the jobs will stay queued unless I force them to run with the “qrun” command as root user. I did check my system requirements, and the problem is not arising from this. I am seeking someone’s help to determine what could be done to fix this issue. My overall goal is to send jobs to the queue and have the job scheduler complete these jobs as system requirements become available. Please find below some details regarding my system.

We are running OpenPBS on a single server.

OUTPUT of etc/hosts:
10.18.62.60 cc-3dfr.bcrc.local cc-3dfr
-------------------------------------------------------------------------------------------------------OUTPUT of etc/pbs.conf
PBS_SERVER=cc-3dfr
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
-------------------------------------------------------------------------------------------------------OUTPUT of pbsnodes –av
(base) [modeleval@cc-3dfr etc]$ pbsnodes -av
cc-3dfr
Mom = cc-3dfr.bcrc.local
ntype = PBS
state = free
pcpus = 24
resources_available.arch = linux
resources_available.host = cc-3dfr
resources_available.mem = 148244804kb
resources_available.ncpus = 24
resources_available.vnode = cc-3dfr
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Tue Nov 29 03:49:03 2022
last_used_time = Wed Dec 21 14:33:06 2022
------------------------------------------------------------------------------------------------------ADDITIONAL COMMANDS:
(base) [modeleval@cc-3dfr ~]$ hostname
cc-3dfr
(base) [modeleval@cc-3dfr ~]$ hostname -f
cc-3dfr.bcrc.local
(base) [modeleval@cc-3dfr etc]$ pbs_hostn -v cc-3dfr
primary name: cc-3dfr.bcrc.local (from gethostbyname())
aliases: cc-3dfr
address length: 4 bytes
address: 10.18.62.60 (1514672650 dec) name: cc-3dfr.bcrc.local

(base) [modeleval@cc-3dfr ~]$ ping cc-3dfr
PING cc-3dfr.bcrc.local (10.18.62.60) 56(84) bytes of data.
64 bytes from cc-3dfr.bcrc.local (10.18.62.60): icmp_seq=1 ttl=64 time=0.063 ms
64 bytes from cc-3dfr.bcrc.local (10.18.62.60): icmp_seq=2 ttl=64 time=0.071 ms
(base) [modeleval@cc-3dfr ~]$ nslookup cc-3dfr
Server: 10.18.10.5
Address: 10.18.10.5#53

Name: cc-3dfr.bcrc.local
Address: 10.18.62.60

Any guidance to this problem would be greatly appreciated! Thank you!

Best,

Bill McLaughlin and Thomas Parry

Please submit 2 jobs as below and share the output of the below commands

qstat -Bf
qmgr -c 'p q batch'
qsub -q batch -- /bin/sleep 100
qsub -q batch -- /bin/sleep 100
qstat -answ1
qstat -f < jobid > ; example :  qstat -f  101 
tracejob < jobid > ; example tracejob  101

If the job is in the queue

  • check the scheduler logs against this job id ( probably increase the log level , if you do not see much information and submit another job)
  • check the server logs against t his job id (probably increase log level , if you do not see much information and submit annother job)

If qrun without specifying nodes runs the job, then a special scheduler iteration runs the job. That lets the scheduler ignore the state of the queue (is it started?) and any limits.

As someone else said: crank up the log events and look at the scheduler logs, and also look at the the job comment which is normally set by the scheduler when it considers a job but does not run it.

If there is none the scheduler hasn’t considered it…perhaps the server “scheduling” attribute is set to false.

Thank you for taking the time to reply! Here are the outputs for your requested commands.

1.

(base) [modeleval@cc-3dfr ~]$ qstat -Bf
Server: cc-3dfr
server_state = Active
server_host = cc-3dfr.bcrc.local
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
log_events = 511
mailer = /usr/sbin/sendmail
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
scheduler_iteration = 600
flatuid = True
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
pbs_version = 20.0.0
eligible_time_enable = False
max_concurrent_provision = 5
max_job_sequence_id = 9999999

2.

(base) [modeleval@cc-3dfr ~]$ qmgr -c ‘p q batch’
Create queues and set their attributes.
Create and define queue batch
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.ncpus = 1
set queue batch resources_default.nodect = 1
set queue batch resources_default.nodes = 1
set queue batch resources_default.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True

3.

(base) [modeleval@cc-3dfr ~]$ qsub -q batch – /bin/sleep 100
2537.cc-3dfr

4.

(base) [modeleval@cc-3dfr ~]$ qsub -q batch – /bin/sleep 100
2538.cc-3dfr

5.

(base) [modeleval@cc-3dfr ~]$ qstat -answ1
cc-3dfr:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


2537.cc-3dfr modeleval batch STDIN – 1 1 – 01:00 Q – –

2538.cc-3dfr modeleval batch STDIN – 1 1 – 01:00 Q – –

6.

(base) [modeleval@cc-3dfr ~]$ qstat -f 2537
Job Id: 2537.cc-3dfr
Job_Name = STDIN
Job_Owner = modeleval@cc-3dfr
job_state = Q
queue = batch
server = cc-3dfr
Checkpoint = u
ctime = Thu Dec 29 15:05:51 2022
Error_Path = cc-3dfr:/home/modeleval/STDIN.e2537
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Thu Dec 29 15:05:51 2022
Output_Path = cc-3dfr:/home/modeleval/STDIN.o2537
Priority = 0
qtime = Thu Dec 29 15:05:51 2022
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.place = scatter
Resource_List.select = 1:ncpus=1
Resource_List.walltime = 01:00:00
substate = 10
Variable_List = PBS_O_HOME=/home/modeleval,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=modeleval,
PBS_O_PATH=/home/modeleval/anaconda3/bin:/home/modeleval/anaconda3/con
dabin:/home/modeleval/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/usr
/local/sbin:/usr/sbin:/opt/pbs/bin:/home/modeleval/ResiRole/FEATURE/fea
ture-3.0.0/src:/home/modeleval/ResiRole/FEATURE/feature-3.0.0/tools/bin
:/home/modeleval/bin,PBS_O_MAIL=/var/spool/mail/modeleval,
PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/home/modeleval,PBS_O_SYSTEM=Linux,
PBS_O_QUEUE=batch,PBS_O_HOST=cc-3dfr
etime = Thu Dec 29 15:05:51 2022
Submit_arguments = -q batch – /bin/sleep 100
executable = jsdl-hpcpa:Executable>/bin/sleep</jsdl-hpcpa:Executable>
argument_list = jsdl-hpcpa:Argument>100</jsdl-hpcpa:Argument>
project = _pbs_project_default
Submit_Host = cc-3dfr

7.

(base) [modeleval@cc-3dfr ~]$ tracejob 2537
Job: 2537.cc-3dfr
12/29/2022 15:05:51 S enqueuing into batch, state Q hop 1
12/29/2022 15:05:51 S Job Queued at request of modeleval@cc-3dfr, owner = modeleval@cc-3dfr, job name = STDIN, queue = batch

Additionally, here are the logs against the submitted job ID’s (2537 and 2538):
Scheduler_log:

12/29/2022 14:59:12;0002;pbs_sched;Svr;Log;Log opened
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;pbs_version=20.0.0
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;pbs_build=mach=N/A:security=N/A:configure_args=N/A
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;hostname=cc-3dfr;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;ipv4 interface lo: cc-3dfr.bcrc.local
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;ipv4 interface eno1: cc-3dfr.bcrc.local
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;ipv4 interface virbr0: cc-3dfr
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;ipv6 interface lo: cc-3dfr.bcrc.local
12/29/2022 14:59:12;0002;pbs_sched;Svr;pbs_sched;ipv6 interface eno1: cc-3dfr
12/29/2022 14:59:12;0002;pbs_sched;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
12/29/2022 14:59:12;0006;pbs_sched;Fil;pbs_sched;Version 20.0.0, started, initialization type = 0
12/29/2022 14:59:12;0002;pbs_sched;Svr;sched_main;/opt/pbs/sbin/pbs_sched startup pid 438795
12/29/2022 14:59:12;0040;pbs_sched;Fil;fairshare usage;Creating usage database for fairshare
12/29/2022 14:59:12;0080;pbs_sched;Req;;Launching 12 worker threads
12/29/2022 14:59:16;0001;pbs_sched;Svr;pbs_sched;Access from host not allowed, or unknown host (15008) in open_server_conns, Couldn’t register the scheduler default with connected server

Server_Log

12/29/202215:05:51;0100;Server@cc3dfr;Job;2537.cc3dfr;enqueuing mailto: 15:05:51;0100;Server@cc-3dfr;Job;2537.cc-3dfr;enqueuing into batch, state Q hop 1
12/29/2022 15:05:51;0008;Server@cc-3dfr;Job;2537.cc-3dfr;Job mailto: 15:05:51;0008;Server@cc-3dfr;Job;2537.cc-3dfr;Job Queued at request of modeleval@cc-3dfr, owner = modeleval@cc-3dfr, job name = STDIN, queue = batch
12/29/2022 15:05:53;0100;Server@cc-3dfr;Req;;Type 0 request received from root@cc-3dfr, sock=19
12/29/2022 15:05:53;0100;Server@cc-3dfr;Req;;Type 95 request received from root@cc-3dfr, sock=20
12/29/2022 15:05:53;0100;Server@cc-3dfr;Req;;Type 98 request received from root@cc-3dfr, sock=19
12/29/2022 15:05:53;00a0;Server@cc-3dfr;Req;req_reject;Reject reply code=15008, aux=0, type=98, from root@cc-3dfr
12/29/2022 15:05:55;0100;Server@cc-3dfr;Req;;Type 0 request received from root@cc-3dfr, sock=19
12/29/2022 15:05:55;0100;Server@cc-3dfr;Req;;Type 1 request received from modeleval@cc-3dfr, sock=17
12/29/2022 15:05:55;0100;Server@cc-3dfr;Job;2538.cc-3dfr;enqueuing(mailto: 15:05:55;0100;Server@cc-3dfr;Job;2538.cc-3dfr;enqueuing) into batch, state Q hop 1
12/29/2022 [15:05:55;0008;Server@cc-3dfr;Job;2538.cc-3dfr;Job](mailto: 15:05:55;0008;Server@cc-3dfr;Job;2538.cc-3dfr;Job) Queued at request of modeleval@cc-3dfr, owner = modeleval@cc-3dfr, job name = STDIN, queue = batch
12/29/2022 15:05:55;0100;Server@cc-3dfr;Req;;Type 95 request received from root@cc-3dfr, sock=20
12/29/2022 15:05:55;0100;Server@cc-3dfr;Req;;Type 98 request received from root@cc-3dfr, sock=19
12/29/2022 15:05:55;00a0;Server@cc-3dfr;Req;req_reject;Reject reply code=15008, aux=0, type=98, from root@cc-3dfr
12/29/2022 15:05:57;0100;Server@cc-3dfr;Req;;Type 0 request received from root@cc-3dfr, sock=19
12/29/2022 15:05:57;0100;Server@cc-3dfr;Req;;Type 95 request received from root@cc-3dfr, sock=20
12/29/2022 15:05:57;0100;Server@cc-3dfr;Req;;Type 98 request received from root@cc-3dfr, sock=19

Sincerely,

Bill and Thomas

Thank you for the suggestion, Alexis.

FYI we ran the following command.

qmgr -c “set server scheduling=true”

It appears that subsequently the jobs remain stuck in the queue.

Also, when we force a job to run from the queue, this is the command we use as root user:

qrun -H cc-3dfr < jobid >

Is there another attribute that we possible need to modify?

Sincerely,

Bill and Thomas

Please check the below

  1. Make sure ports 15001 to 15009 are opened for communication
  2. firewall are not blocking the ports and SELinux is disabled and system is rebooted if you disable it now.
  3. Qstat : invalid credential => due to name resolution - #4 by adarsh
    [ use the short and fqdn in the clientsfile and check whether it does the trick ]

Dear Adarsh,

Thank you for your comments.

Please know that we had to perform an upgrade of the system and in the process I had to reinstall OpenPBS and PostgreSQL. Please see a new thread entitled “Difficulty with installation”; and please possibly provide advice on how to proceed from there.

Sincerely,

Bill McLaughlin