Job stuck in queue, multiple servers

Greetings,

I have installed openpbs on a workstation (a single machine run everything, it is possible to remotely access through ssh).
I managed to send jobs just fine until a couple of months ago. I haven’t used it for a while and now that I’m back the job I sent are stuck in queue. There are no other jobs running. If I delete the jobs no error files are created. I checked logs and everything seems fine.

I noted this detail though, when I input pbsnodes -av I have two identical servers with different status (stale and free)

sysadmin@Precision-7920-Tower:~/testVASP/newtest$ pbsnodes -av
precision-7920-tower
Mom = precision-7920-tower
ntype = PBS
state = Stale
pcpus = 20
resources_available.arch = linux
resources_available.host = precision-7920-tower
resources_available.mem = 97495476kb
resources_available.ncpus = 20
resources_available.vnode = precision-7920-tower
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Wed Jul 20 14:45:18 2022
last_used_time = Tue Jan 18 22:15:56 2022

Precision-7920-Tower
Mom = precision-7920-tower
ntype = PBS
state = free
pcpus = 20
resources_available.arch = linux
resources_available.host = precision-7920-tower
resources_available.mem = 97495476kb
resources_available.ncpus = 20
resources_available.vnode = Precision-7920-Tower
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Wed Jul 20 14:45:18 2022
last_used_time = Wed Jul 20 13:16:42 2022

I wonder if this could be the issue.

Thanks!

Update: I had removed the nodes and recreated with qmgr. Now the pbdsnodes -av gives me a single results with a free state:

sysadmin@Precision-7920-Tower:~/testVASP/newtest$ pbsnodes -av
Precision-7920-Tower
Mom = precision-7920-tower
ntype = PBS
state = free
pcpus = 20
resources_available.arch = linux
resources_available.host = precision-7920-tower
resources_available.mem = 97495476kb
resources_available.ncpus = 20
resources_available.vnode = Precision-7920-Tower
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Wed Jul 20 15:38:04 2022

This still does not solve the issue with the job in queue. I will wait for answers.

Thanks!

Please share us the output of the below commands:

  • qstat -answ1
  • qstat -fx
  • qstat -Bf

Sorry for the late answer. This are the outputs:
sysadmin@Precision-7920-Tower:~/testVASP/newtest$ qstat -answ

Precision-7920-Tower:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


4011.Precision-7920-Tower sysadmin workq testjob – 1 20 – 24:00 Q –

sysadmin@Precision-7920-Tower:~/testVASP/newtest$ qstat -fx
qstat: PBS is not configured to maintain job history

sysadmin@Precision-7920-Tower:~/testVASP/newtest$ qstat -Bf
Server: Precision-7920-Tower
server_state = Active
server_host = precision-7920-tower
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
managers = root@Precision-7920-Tower
default_queue = workq
log_events = 511
mailer = /usr/sbin/sendmail
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_max.mpiprocs = 20
resources_max.ncpus = 20
scheduler_iteration = 600
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
pbs_version = 20.0.0
eligible_time_enable = False
max_concurrent_provision = 5
max_job_sequence_id = 9999999

Thanks!

Please try this:

  1. qmgr -c “print nodes @default” > printnodes.txt
  2. qmgr -c “delete nodes @default
  3. qmgr -c “create node precision-7920-tower”
  4. Then submit a test job
  5. qstat -answ1
  6. pbsnodes -av
1 Like

Sorry for the delayed reply. Everything seems to work fine now and I am not clear what happened. I will let you know if I encounter other issues. Meanwhile we can consider the issue solved.

Thank you again!

1 Like