I have created a series of vnodes from a host in order to more easily keep track of GPU allocation and associated control CPUs. However, jobs submitted to the queue never run and have an erroneous comment. " Not Running: Host set host=node0073 has too few free resources" the queue being used does not contain node0073. This cluster is running PBSPRO 14.2.
I believe my vnode configuration was correct, but that is here for scrutiny:
$configversion 2
node0115: resources_available.ncpus = 0
node0115: resources_available.mem = 0
node0115[0] : resources_available.ncpus = 4
node0115[0] : resources_available.mem = 125gb
node0115[0] : resources_available.ngpus = 1
node0115[0] : resources_available.gpu_id = gpu0
node0115[0] : sharing = default_excl
node0115[1] : resources_available.ncpus = 4
node0115[1] : resources_available.mem = 125gb
node0115[1] : resources_available.ngpus = 1
node0115[1] : resources_available.gpu_id = gpu1
node0115[1] : sharing = default_excl
node0115[2] : resources_available.ncpus = 4
node0115[2] : resources_available.mem = 125gb
node0115[2] : resources_available.ngpus = 1
node0115[2] : resources_available.gpu_id = gpu2
node0115[2] : sharing = default_excl
node0115[3] : resources_available.ncpus = 4
node0115[3] : resources_available.mem = 125gb
node0115[3] : resources_available.ngpus = 1
node0115[3] : resources_available.gpu_id = gpu3
node0115[3] : sharing = default_excl
node0115[4] : resources_available.ncpus = 4
node0115[4] : resources_available.mem = 125gb
node0115[4] : resources_available.ngpus = 1
node0115[4] : resources_available.gpu_id = gpu4
node0115[4] : sharing = default_excl
node0115[5] : resources_available.ncpus = 4
node0115[5] : resources_available.mem = 125gb
node0115[5] : resources_available.ngpus = 1
node0115[5] : resources_available.gpu_id = gpu5
node0115[5] : sharing = default_excl
node0115[6] : resources_available.ncpus = 4
node0115[6] : resources_available.mem = 125gb
node0115[6] : resources_available.ngpus = 1
node0115[6] : resources_available.gpu_id = gpu6
node0115[6] : sharing = default_excl
node0115[7] : resources_available.ncpus = 4
node0115[7] : resources_available.mem = 125gb
node0115[7] : resources_available.ngpus = 1
node0115[7] : resources_available.gpu_id = gpu7
node0115[7] : sharing = default_excl
From the pbsnodes query for one of the virtual nodes $ pbsnodes -v node0115[0]
node0115[0]
Mom = node0115.thunder.ccast
ntype = PBS
state = free
resources_available.arch = linux
resources_available.gpu_id = gpu0
resources_available.host = node0115
resources_available.mem = 125gb
resources_available.ncpus = 4
resources_available.ngpus = 1
resources_available.vnode = node0115[0]
resources_assigned.accelerator_memory = 0kb
resources_assigned.condo = 0
resources_assigned.gpuHost = 0
resources_assigned.mem = 0kb
resources_assigned.mic_cores = 0
resources_assigned.mic_density = 0kb
resources_assigned.mic_size = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.ngpus = 0
resources_assigned.nmics = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_excl
As far as I can tell, it is listing all the values as expected.
and the host $ pbsnodes node0115
node0115
Mom = node0115.thunder.ccast
ntype = PBS
state = free
pcpus = 36
resv_enable = True
sharing =
resources_available.arch = linux
resources_available.host = node0115
resources_available.mem = 1048576000kb
resources_available.ncpus = 32
resources_available.plist = broadwell
resources_available.qlist = devGpuHost
resources_available.vnode =
resources_available.accelerator_memory = 0kb
resources_available.condo = 0
resources_available.gpuHost = 0
resources_available.mic_cores = 0
resources_available.mic_density = 0kb
resources_available.mic_size = 0kb
resources_available.naccelerators = 0
resources_available.netwins = 0
resources_available.ngpus = 8
resources_available.nmics = 0
resources_available.vmem = 0kb
resources_available.gpu_id =
resources_assigned.mem = 0kb
resources_assigned.ncpus = 0
resources_assigned.accelerator_memory = 0kb
resources_assigned.condo = 0
resources_assigned.gpuHost = 0
resources_assigned.mic_cores = 0
resources_assigned.mic_density = 0kb
resources_assigned.mic_size = 0kb
resources_assigned.naccelerators = 0
resources_assigned.netwins = 0
resources_assigned.ngpus = 0
resources_assigned.nmics = 0
resources_assigned.vmem = 0kb
The PBS script header used in job submission:
#!/bin/bash
#PBS -q gpu-devel
#PBS -N c0-3g0
#PBS -j oe
#PBS -l select=1:mem=24gb:ncpus=4:ngpus=1
#PBS -l walltime=168:00:00
#PBS -m abe
the comment output from this job $ qstat -s 234403
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
234403.bright01 user01 gpu-deve c0-3g0 -- 1 4 24gb 168:0 Q --
Not Running: Host set host=node0073 has too few free resources
As mentioned, the node listed in the comment is not in the gpu-devel queue. Interestingly it is the last node listed before the vnodes by
$ pbsnodes -avSj
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs
...
node0081 free 2 2 0 70gb/100gb 4/44 0/0 0/0 234646,234648
node0080 free 2 2 0 70gb/100gb 4/44 0/0 0/0 233059,234649
node0079 job-busy 1 1 0 80gb/100gb 0/44 0/0 0/0 232747
node0078 free 1 1 0 72gb/100gb 4/44 0/0 0/0 224919
node0065 job-busy 2 2 0 38gb/64gb 0/20 0/0 0/0 234292,234366
node0073 offline 1 1 0 43gb/63gb 0/20 0/0 0/0 233923
node0115[0] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
node0115[1] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
node0115[2] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
node0115[3] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
node0115[4] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
node0115[5] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
node0115[6] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
node0115[7] free 0 0 0 125gb/125gb 4/4 0/0 1/1 --
Has anyone experienced a similar error before or have an idea as to why it is occurring?