Greetings,
I have installed openpbs on a workstation (a single machine run everything, it is possible to remotely access through ssh).
I managed to send jobs just fine until a couple of months ago. I haven’t used it for a while and now that I’m back the job I sent are stuck in queue. There are no other jobs running. If I delete the jobs no error files are created. I checked logs and everything seems fine.
I noted this detail though, when I input pbsnodes -av I have two identical servers with different status (stale and free)
sysadmin@Precision-7920-Tower:~/testVASP/newtest$ pbsnodes -av
precision-7920-tower
Mom = precision-7920-tower
ntype = PBS
state = Stale
pcpus = 20
resources_available.arch = linux
resources_available.host = precision-7920-tower
resources_available.mem = 97495476kb
resources_available.ncpus = 20
resources_available.vnode = precision-7920-tower
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Wed Jul 20 14:45:18 2022
last_used_time = Tue Jan 18 22:15:56 2022
Precision-7920-Tower
Mom = precision-7920-tower
ntype = PBS
state = free
pcpus = 20
resources_available.arch = linux
resources_available.host = precision-7920-tower
resources_available.mem = 97495476kb
resources_available.ncpus = 20
resources_available.vnode = Precision-7920-Tower
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Wed Jul 20 14:45:18 2022
last_used_time = Wed Jul 20 13:16:42 2022
I wonder if this could be the issue.
Thanks!
Update: I had removed the nodes and recreated with qmgr. Now the pbdsnodes -av gives me a single results with a free state:
sysadmin@Precision-7920-Tower:~/testVASP/newtest$ pbsnodes -av
Precision-7920-Tower
Mom = precision-7920-tower
ntype = PBS
state = free
pcpus = 20
resources_available.arch = linux
resources_available.host = precision-7920-tower
resources_available.mem = 97495476kb
resources_available.ncpus = 20
resources_available.vnode = Precision-7920-Tower
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Wed Jul 20 15:38:04 2022
This still does not solve the issue with the job in queue. I will wait for answers.
Thanks!
adarsh
July 21, 2022, 6:37am
3
Please share us the output of the below commands:
qstat -answ1
qstat -fx
qstat -Bf
Sorry for the late answer. This are the outputs:
sysadmin@Precision-7920-Tower:~/testVASP/newtest$ qstat -answ
Precision-7920-Tower:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
4011.Precision-7920-Tower sysadmin workq testjob – 1 20 – 24:00 Q –
–
–
sysadmin@Precision-7920-Tower:~/testVASP/newtest$ qstat -fx
qstat: PBS is not configured to maintain job history
sysadmin@Precision-7920-Tower:~/testVASP/newtest$ qstat -Bf
Server: Precision-7920-Tower
server_state = Active
server_host = precision-7920-tower
scheduling = True
total_jobs = 1
state_count = Transit:0 Queued:1 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
managers = root@Precision-7920-Tower
default_queue = workq
log_events = 511
mailer = /usr/sbin/sendmail
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_max.mpiprocs = 20
resources_max.ncpus = 20
scheduler_iteration = 600
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
pbs_version = 20.0.0
eligible_time_enable = False
max_concurrent_provision = 5
max_job_sequence_id = 9999999
Thanks!
Sorry for the delayed reply. Everything seems to work fine now and I am not clear what happened. I will let you know if I encounter other issues. Meanwhile we can consider the issue solved.
Thank you again!
1 Like