Hi,
We have 4 worker nodes of which 2 are down. We have run the multiple cases on 2 available nodes. However, all of a sudden the same job which ran earlier goes into a que and mentions " comment = Not Running: Not enough free nodes available"
I can see that the pbsnodes -a still shows old job ids and is currently in “state = job-busy”. I am unsure why the old jobs ids are still being shown when the jobs have run and exited.
The nodes have 8 cpus and the old jobs have taken 0 to 7 from each node. There is something that is not releasing the resources of old jobs back to the pool. How can we fix this issue?
`[root@mgt1 pbs]# qstat -answ1
mgt1:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
288.mgt1 hpcuser workq perl_job 14898 2 12 -- 00:30 E 00:00:01 node01/0*6+node03/0*6
Job run at Wed Sep 14 at 19:10 on (node01:ncpus=6)+(node03:ncpus=6)
289.mgt1 hpcuser workq perl_job 33051 2 12 -- 00:30 E 00:00:01 node01/1*6+node03/1*6
Job run at Thu Sep 15 at 10:16 on (node01:ncpus=6)+(node03:ncpus=6)
290.mgt1 hpcuser workq perl_job 33120 2 12 -- 00:30 E 00:00:01 node01/2*6+node03/2*6
Job run at Thu Sep 15 at 10:16 on (node01:ncpus=6)+(node03:ncpus=6)
291.mgt1 hpcuser workq perl_job 33206 2 12 -- 00:30 E 00:00:01 node01/3*6+node03/3*6
Job run at Thu Sep 15 at 10:18 on (node01:ncpus=6)+(node03:ncpus=6)
292.mgt1 hpcuser workq perl_job 33390 2 12 -- 00:30 E 00:00:01 node01/4*6+node03/4*6
Job run at Thu Sep 15 at 10:25 on (node01:ncpus=6)+(node03:ncpus=6)
293.mgt1 hpcuser workq perl_job 33503 2 12 -- 00:30 E 00:00:00 node01/5*6+node03/5*6
Job run at Thu Sep 15 at 10:28 on (node01:ncpus=6)+(node03:ncpus=6)
294.mgt1 hpcuser workq perl_job 33572 2 12 -- 00:30 E 00:00:01 node01/6*6+node03/6*6
Job run at Thu Sep 15 at 10:29 on (node01:ncpus=6)+(node03:ncpus=6)
295.mgt1 hpcuser workq perl_job 33641 2 12 -- 00:30 E 00:00:03 node01/7*6+node03/7*6
Job run at Thu Sep 15 at 10:30 on (node01:ncpus=6)+(node03:ncpus=6)
310.mgt1 hpcuser workq perl_job -- 2 12 -- 00:30 Q -- --
Not Running: Not enough free nodes available
(base) [root@mgt1 pbs]#
`
> [root@mgt1 ~]# pbsnodes -a
node01
Mom = node01
Port = 15002
pbs_version = 19.1.1
ntype = PBS
state = job-busy
pcpus = 48
jobs = 288.mgt1/0, 288.mgt1/1, 288.mgt1/2, 288.mgt1/3, 288.mgt1/4, 288.mgt1/5, 289.mgt1/6, 289.mgt1/7, 289.mgt1/8, 289.mgt1/9, 289.mgt1/10, 289.mgt1/11, 290.mgt1/12, 290.mgt1/13, 290.mgt1/14, 290.mgt1/15, 290.mgt1/16, 290.mgt1/17, 291.mgt1/18, 291.mgt1/19, 291.mgt1/20, 291.mgt1/21, 291.mgt1/22, 291.mgt1/23, 292.mgt1/24, 292.mgt1/25, 292.mgt1/26, 292.mgt1/27, 292.mgt1/28, 292.mgt1/29, 293.mgt1/30, 293.mgt1/31, 293.mgt1/32, 293.mgt1/33, 293.mgt1/34, 293.mgt1/35, 294.mgt1/36, 294.mgt1/37, 294.mgt1/38, 294.mgt1/39, 294.mgt1/40, 294.mgt1/41, 295.mgt1/42, 295.mgt1/43, 295.mgt1/44, 295.mgt1/45, 295.mgt1/46, 295.mgt1/47
resources_available.arch = linux
resources_available.host = node01
resources_available.mem = 65184884kb
resources_available.ncpus = 48
resources_available.vnode = node01
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 48
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Thu Sep 15 12:57:01 2022
last_used_time = Mon Sep 5 17:43:15 2022
node02
Mom = node02
Port = 15002
pbs_version = 19.1.1
ntype = PBS
state = state-unknown,down
pcpus = 48
resources_available.host = node02
resources_available.ncpus = 48
resources_available.vnode = node02
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
comment = node down: communication closed
resv_enable = True
sharing = default_shared
last_state_change_time = Thu Sep 15 12:57:01 2022
last_used_time = Fri Nov 20 12:59:37 2020
node03
Mom = node03
Port = 15002
pbs_version = 19.1.1
ntype = PBS
state = job-busy
pcpus = 48
jobs = 288.mgt1/0, 288.mgt1/1, 288.mgt1/2, 288.mgt1/3, 288.mgt1/4, 288.mgt1/5, 289.mgt1/6, 289.mgt1/7, 289.mgt1/8, 289.mgt1/9, 289.mgt1/10, 289.mgt1/11, 290.mgt1/12, 290.mgt1/13, 290.mgt1/14, 290.mgt1/15, 290.mgt1/16, 290.mgt1/17, 291.mgt1/18, 291.mgt1/19, 291.mgt1/20, 291.mgt1/21, 291.mgt1/22, 291.mgt1/23, 292.mgt1/24, 292.mgt1/25, 292.mgt1/26, 292.mgt1/27, 292.mgt1/28, 292.mgt1/29, 293.mgt1/30, 293.mgt1/31, 293.mgt1/32, 293.mgt1/33, 293.mgt1/34, 293.mgt1/35, 294.mgt1/36, 294.mgt1/37, 294.mgt1/38, 294.mgt1/39, 294.mgt1/40, 294.mgt1/41, 295.mgt1/42, 295.mgt1/43, 295.mgt1/44, 295.mgt1/45, 295.mgt1/46, 295.mgt1/47
resources_available.arch = linux
resources_available.host = node03
resources_available.mem = 65184884kb
resources_available.ncpus = 48
resources_available.vnode = node03
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 48
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Thu Sep 15 12:57:01 2022
last_used_time = Mon Sep 5 17:31:19 2022
node04
Mom = node04
Port = 15002
pbs_version = 19.1.1
ntype = PBS
state = state-unknown,down
pcpus = 24
resources_available.host = node04
resources_available.ncpus = 24
resources_available.vnode = node04
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
comment = node down: communication closed
resv_enable = True
sharing = default_shared
last_state_change_time = Thu Sep 15 12:57:01 2022
last_used_time = Sat Feb 15 10:25:30 2020
The job runs when I manually delete an old job. Hence please let us know how the resource can be allocated back to the pool when the job finishes so we do not have to manually delete an old job. The status currently is E for jobs that have run and exited. It seems if the status changes to F the resources will be available to the resource pool without any manual intervention.
The pbs subbmission script is as below
> [hpcuser@mgt1 TEST_PBS]$ cat test.pl
foreach my $n(1..580000)
{
print "$n\n";
}
(base) [hpcuser@mgt1 TEST_PBS]$ cat pbs_script.pbs
#!/bin/bash
#PBS -l nodes=2:ncpus=6
#PBS -l walltime=00:30:00
#PBS -N perl_job
cd /nfsshare/home/hpcuser/MBTD_Genetics/wilson/TEST_PBS
perl /nfsshare/home/hpcuser/MBTD_Genetics/wilson/TEST_PBS/test.pl > ./Result
(base) [hpcuser@mgt1 TEST_PBS]$ qsub pbs_script.pbs
Regards,
Aniesh Mathew