Failed to assign resources to job

Hi all,
I am getting this error when submitting job to a queue. The job is running with with small no. of cores but when I give full cores it’s it goes to hold state. The following is tracejob result. MoM is rejecting saying CgroupProcessingError (‘Failed to assign resources’,). I have checked similar discussions from community, but could not figure out what’s wrong in this case. Any help is much appreciated.

Job: 5.master01

08/17/2021 17:17:19  S    enqueuing into workq, state 1 hop 1
08/17/2021 17:17:19  S    Job Queued at request of vamshi@master01.cm.cluster, owner =
                          vamshi@master01.cm.cluster, job name = resource-request-job, queue = workq
08/17/2021 17:17:20  S    send of job to node03 failed error = 15170 reject_msg=,Processing error in
                          pbs_cgroups handling execjob_begin event for job 5.master01:
                          CgroupProcessingError ('Failed to assign resources',)
08/17/2021 17:17:47  L    Considering job to run
08/17/2021 17:17:47  S    Job Run at request of Scheduler@master01.cm.cluster on exec_vnode
                          (node03:ncpus=40:ngpus=3:mem=1048576kb)
08/17/2021 17:17:47  L    Job run
08/17/2021 17:17:48  S    Unable to Run Job, MOM rejected

Here’s the output of pbsnodes -a

node01
     Mom = node01.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     jobs = 16.master01/0, 16.master01/1, 16.master01/2, 16.master01/3, 16.master01/4, 16.master01/5, 16.master01/6, 16.master01/7, 16.master01/8, 16.master01/9, 16.master01/10, 16.master01/11, 16.master01/12, 16.master01/13, 16.master01/14, 16.master01/15, 16.master01/16, 16.master01/17, 16.master01/18, 16.master01/19
     resources_available.arch = linux
     resources_available.host = node01
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node01
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 20
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 23:09:26 2021
     last_used_time = Tue Aug 17 20:00:35 2021

node02
     Mom = node02.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node02
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node02
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021
     last_used_time = Tue Aug 17 17:18:20 2021

node03
     Mom = node03.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node03
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node03
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021

node04
     Mom = node04.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node04
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node04
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021

node05
     Mom = node05.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 80
     jobs = 22.master01/0, 22.master01/1, 22.master01/2, 22.master01/3, 22.master01/4, 22.master01/5, 22.master01/6, 22.master01/7, 22.master01/8, 22.master01/9, 22.master01/10, 22.master01/11, 22.master01/12, 22.master01/13, 22.master01/14, 22.master01/15, 22.master01/16, 22.master01/17, 22.master01/18, 22.master01/19, 25.master01/20, 25.master01/21, 25.master01/22, 25.master01/23, 25.master01/24, 25.master01/25, 25.master01/26, 25.master01/27, 25.master01/28, 25.master01/29, 25.master01/30, 25.master01/31, 25.master01/32, 25.master01/33, 25.master01/34, 25.master01/35, 25.master01/36, 25.master01/37, 25.master01/38, 25.master01/39
     resources_available.arch = linux
     resources_available.host = node05
     resources_available.mem = 528192512kb
     resources_available.ncpus = 80
     resources_available.ngpus = 8
     resources_available.vmem = 528125952kb
     resources_available.vnode = node05
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 40
     resources_assigned.vmem = 0kb
     queue = p100
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021
     last_used_time = Tue Aug 17 21:32:30 2021

Haven’t noticed until now. I think that the issue is because of the value of the resources_assigned.ncpus. I do not remember assigning a value to it.
I tried to change it using qmgr as below: But it fails with the error -Cannot set attribute, read only or insufficient permission. How can I change it back to 0. 0 means “all”, right? I am running qmgr as root but don’t understand why the error?

Qmgr: set node node01 resources_assigned.ncpus=40
qmgr obj=node01 svr=default: Cannot set attribute, read only or insufficient permission  resources_assigned.ncpus
qmgr: Error (15003) returned from server

Could you please disable the pbs_cgroups hook ? and try to submit jobs.

  • qmgr -c “print hook @default
  • qmgr -c “set hook pbs_cgroups enabled=false”
  • Also resource_assigned.ncpus is automatically populated by scheduler based on how many ncpus are used by the job on the compute node, hence this cannot be set using qmgr.

Thank you @adarsh.
I have disabled the pbs_cgroups hook. But still, the jobs are held.
Also, the jobs never go to node03 and node04 in workq although they are free. The nodes - node01 and node02 are 50% occupied but no additional jobs are run there. How should I make sure all the resouces are used?

Thank you @vamshi

Please follow the below steps

  1. tracejob
  2. ssh
  3. source /etc/pbs.conf ; cd $PBS_HOME/mom_logs/YYYYMMDD
  4. Check the logs against that job

When the job is held

  • usually it is related to authentication, password not set, home directory missing, user unable to logon to the compute node and other system related

Please share the job script and output of this command $PBS_EXEC/unsupported/pbs_dtj