I’m trying to divide the whole cluster into three parts, each of which has all nodes of one same type of hardware in it. I created an resource Qlist (just as the Administrator Book says):
create resource Qlist
set resource Qlist type = string_array
set resource Qlist flag = h
and node20, one of the nodes, is configured like this: (output of qmgr -c print node node20
)
create node node20 Mom=node20.localdomain
set node node20 state = job-busy
set node node20 resources_available.arch = linux
set node node20 resources_available.host = node20
set node node20 resources_available.mem = 97617396kb
set node node20 resources_available.ncpus = 24
set node node20 resources_available.Qlist = n24
set node node20 resources_available.Qlist += n24_96
set node node20 resources_available.vnode = node20
set node node20 resv_enable = True
set node node20 sharing = default_shared
I have checked that Qlist
does exist in resources
tag in /var/spool/pbs/sched_priv/sched_config
.
After all these configurations, I submitted an job whose job scripts looks like:
#PBS -N test-hold-128
#PBS -l nodes=1:ppn=24
#PBS -l Qlist=n24_128
#PBS -q maintaince
##PBS -l Qqueue=maintaince
sleep 10000000
and got that job allocated onto node20 by PBS. As one can see node20 has only configured with Qlist=n24, n24_96
, while the job requests n24_128
. In my mind this job should never be allocated onto node20. What am I doing wrong?
Here’s qstat -f
of that job:
Job_Name = test-hold-128
Job_Owner = admin@node1
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 5836kb
resources_used.ncpus = 24
resources_used.vmem = 49812kb
resources_used.walltime = 00:14:43
job_state = R
queue = maintaince
server = wiz
Checkpoint = u
ctime = Mon May 21 19:59:26 2018
Error_Path = node1:/home/admin/test-hold-128.e7082
exec_host = node20/0*24
exec_vnode = (node20:ncpus=24)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Mon May 21 20:09:21 2018
Output_Path = node1:/home/admin/test-hold-128.o7082
Priority = 0
qtime = Mon May 21 19:59:26 2018
Rerunable = True
Resource_List.mpiprocs = 24
Resource_List.ncpus = 24
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=24
Resource_List.place = scatter
Resource_List.preempt_targets = None
Resource_List.Qlist = n24_128
Resource_List.select = 1:ncpus=24:mpiprocs=24
stime = Mon May 21 20:09:21 2018
session_id = 90842
jobdir = /home/admin
substate = 42
Variable_List = ...
comment = Job run at Mon May 21 at 20:09 on (node20:ncpus=24)
etime = Mon May 21 19:59:26 2018
run_count = 1
Submit_arguments = hold.sh
project = _pbs_project_default
Here’s tracejob
of that job:
tracejob 7082
Job: 7082.node1
05/21/2018 19:59:26 L Considering job to run
05/21/2018 19:59:26 L Insufficient amount of resource: ncpus
05/21/2018 19:59:26 S enqueuing into maintaince, state 1 hop 1
05/21/2018 19:59:26 S Job Queued at request of admin@node1, owner = admin@node1, job name = test-hold-128, queue = maintaince
05/21/2018 19:59:26 S Job Modified at request of Scheduler@node1
05/21/2018 19:59:26 A queue=maintaince
05/21/2018 19:59:27 L Considering job to run
05/21/2018 19:59:27 L Insufficient amount of resource: ncpus
05/21/2018 19:59:28 L Considering job to run
05/21/2018 19:59:28 L Insufficient amount of resource: ncpus
05/21/2018 19:59:28 L Considering job to run
05/21/2018 19:59:28 L Insufficient amount of resource: ncpus
05/21/2018 20:07:45 L Considering job to run
05/21/2018 20:07:45 L Insufficient amount of resource: ncpus
05/21/2018 20:09:21 L Considering job to run
05/21/2018 20:09:21 S Job Run at request of Scheduler@node1 on exec_vnode (node20:ncpus=24)
05/21/2018 20:09:21 S Job Modified at request of Scheduler@node1
05/21/2018 20:09:21 L Job run
05/21/2018 20:09:21 A user=admin group=users project=_pbs_project_default jobname=test-hold-128 queue=maintaince ctime=1526903966 qtime=1526903966 etime=1526903966 start=1526904561 exec_host=node20/0*24
exec_vnode=(node20:ncpus=24) Resource_List.mpiprocs=24 Resource_List.ncpus=24 Resource_List.nodect=1 Resource_List.nodes=1:ppn=24 Resource_List.place=scatter
Resource_List.preempt_targets=None Resource_List.Qlist=n24_128 Resource_List.select=1:ncpus=24:mpiprocs=24 resource_assigned.ncpus=24
(Please kindly ignore the typo of ‘maintanence’ )