Hi all,
I am getting this error when submitting job to a queue. The job is running with with small no. of cores but when I give full cores it’s it goes to hold state. The following is tracejob result. MoM is rejecting saying CgroupProcessingError (‘Failed to assign resources’,). I have checked similar discussions from community, but could not figure out what’s wrong in this case. Any help is much appreciated.
Job: 5.master01
08/17/2021 17:17:19 S enqueuing into workq, state 1 hop 1
08/17/2021 17:17:19 S Job Queued at request of vamshi@master01.cm.cluster, owner =
vamshi@master01.cm.cluster, job name = resource-request-job, queue = workq
08/17/2021 17:17:20 S send of job to node03 failed error = 15170 reject_msg=,Processing error in
pbs_cgroups handling execjob_begin event for job 5.master01:
CgroupProcessingError ('Failed to assign resources',)
08/17/2021 17:17:47 L Considering job to run
08/17/2021 17:17:47 S Job Run at request of Scheduler@master01.cm.cluster on exec_vnode
(node03:ncpus=40:ngpus=3:mem=1048576kb)
08/17/2021 17:17:47 L Job run
08/17/2021 17:17:48 S Unable to Run Job, MOM rejected
Here’s the output of pbsnodes -a
node01
Mom = node01.cm.cluster
Port = 15002
pbs_version = 19.1.3
ntype = PBS
state = free
pcpus = 40
jobs = 16.master01/0, 16.master01/1, 16.master01/2, 16.master01/3, 16.master01/4, 16.master01/5, 16.master01/6, 16.master01/7, 16.master01/8, 16.master01/9, 16.master01/10, 16.master01/11, 16.master01/12, 16.master01/13, 16.master01/14, 16.master01/15, 16.master01/16, 16.master01/17, 16.master01/18, 16.master01/19
resources_available.arch = linux
resources_available.host = node01
resources_available.mem = 131826688kb
resources_available.ncpus = 40
resources_available.ngpus = 3
resources_available.vmem = 131761152kb
resources_available.vnode = node01
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 20
resources_assigned.vmem = 0kb
queue = workq
resv_enable = True
sharing = default_shared
last_state_change_time = Tue Aug 17 23:09:26 2021
last_used_time = Tue Aug 17 20:00:35 2021
node02
Mom = node02.cm.cluster
Port = 15002
pbs_version = 19.1.3
ntype = PBS
state = free
pcpus = 40
resources_available.arch = linux
resources_available.host = node02
resources_available.mem = 131826688kb
resources_available.ncpus = 40
resources_available.ngpus = 3
resources_available.vmem = 131761152kb
resources_available.vnode = node02
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = workq
resv_enable = True
sharing = default_shared
last_state_change_time = Tue Aug 17 18:57:20 2021
last_used_time = Tue Aug 17 17:18:20 2021
node03
Mom = node03.cm.cluster
Port = 15002
pbs_version = 19.1.3
ntype = PBS
state = free
pcpus = 40
resources_available.arch = linux
resources_available.host = node03
resources_available.mem = 131826688kb
resources_available.ncpus = 40
resources_available.ngpus = 3
resources_available.vmem = 131761152kb
resources_available.vnode = node03
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = workq
resv_enable = True
sharing = default_shared
last_state_change_time = Tue Aug 17 18:57:20 2021
node04
Mom = node04.cm.cluster
Port = 15002
pbs_version = 19.1.3
ntype = PBS
state = free
pcpus = 40
resources_available.arch = linux
resources_available.host = node04
resources_available.mem = 131826688kb
resources_available.ncpus = 40
resources_available.ngpus = 3
resources_available.vmem = 131761152kb
resources_available.vnode = node04
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = workq
resv_enable = True
sharing = default_shared
last_state_change_time = Tue Aug 17 18:57:20 2021
node05
Mom = node05.cm.cluster
Port = 15002
pbs_version = 19.1.3
ntype = PBS
state = free
pcpus = 80
jobs = 22.master01/0, 22.master01/1, 22.master01/2, 22.master01/3, 22.master01/4, 22.master01/5, 22.master01/6, 22.master01/7, 22.master01/8, 22.master01/9, 22.master01/10, 22.master01/11, 22.master01/12, 22.master01/13, 22.master01/14, 22.master01/15, 22.master01/16, 22.master01/17, 22.master01/18, 22.master01/19, 25.master01/20, 25.master01/21, 25.master01/22, 25.master01/23, 25.master01/24, 25.master01/25, 25.master01/26, 25.master01/27, 25.master01/28, 25.master01/29, 25.master01/30, 25.master01/31, 25.master01/32, 25.master01/33, 25.master01/34, 25.master01/35, 25.master01/36, 25.master01/37, 25.master01/38, 25.master01/39
resources_available.arch = linux
resources_available.host = node05
resources_available.mem = 528192512kb
resources_available.ncpus = 80
resources_available.ngpus = 8
resources_available.vmem = 528125952kb
resources_available.vnode = node05
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 40
resources_assigned.vmem = 0kb
queue = p100
resv_enable = True
sharing = default_shared
last_state_change_time = Tue Aug 17 18:57:20 2021
last_used_time = Tue Aug 17 21:32:30 2021