Failed to assign resources to job

vamshi · August 17, 2021, 12:01pm

Hi all,
I am getting this error when submitting job to a queue. The job is running with with small no. of cores but when I give full cores it’s it goes to hold state. The following is tracejob result. MoM is rejecting saying CgroupProcessingError (‘Failed to assign resources’,). I have checked similar discussions from community, but could not figure out what’s wrong in this case. Any help is much appreciated.

Job: 5.master01

08/17/2021 17:17:19  S    enqueuing into workq, state 1 hop 1
08/17/2021 17:17:19  S    Job Queued at request of vamshi@master01.cm.cluster, owner =
                          vamshi@master01.cm.cluster, job name = resource-request-job, queue = workq
08/17/2021 17:17:20  S    send of job to node03 failed error = 15170 reject_msg=,Processing error in
                          pbs_cgroups handling execjob_begin event for job 5.master01:
                          CgroupProcessingError ('Failed to assign resources',)
08/17/2021 17:17:47  L    Considering job to run
08/17/2021 17:17:47  S    Job Run at request of Scheduler@master01.cm.cluster on exec_vnode
                          (node03:ncpus=40:ngpus=3:mem=1048576kb)
08/17/2021 17:17:47  L    Job run
08/17/2021 17:17:48  S    Unable to Run Job, MOM rejected

Here’s the output of pbsnodes -a

node01
     Mom = node01.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     jobs = 16.master01/0, 16.master01/1, 16.master01/2, 16.master01/3, 16.master01/4, 16.master01/5, 16.master01/6, 16.master01/7, 16.master01/8, 16.master01/9, 16.master01/10, 16.master01/11, 16.master01/12, 16.master01/13, 16.master01/14, 16.master01/15, 16.master01/16, 16.master01/17, 16.master01/18, 16.master01/19
     resources_available.arch = linux
     resources_available.host = node01
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node01
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 20
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 23:09:26 2021
     last_used_time = Tue Aug 17 20:00:35 2021

node02
     Mom = node02.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node02
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node02
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021
     last_used_time = Tue Aug 17 17:18:20 2021

node03
     Mom = node03.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node03
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node03
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021

node04
     Mom = node04.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node04
     resources_available.mem = 131826688kb
     resources_available.ncpus = 40
     resources_available.ngpus = 3
     resources_available.vmem = 131761152kb
     resources_available.vnode = node04
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021

node05
     Mom = node05.cm.cluster
     Port = 15002
     pbs_version = 19.1.3
     ntype = PBS
     state = free
     pcpus = 80
     jobs = 22.master01/0, 22.master01/1, 22.master01/2, 22.master01/3, 22.master01/4, 22.master01/5, 22.master01/6, 22.master01/7, 22.master01/8, 22.master01/9, 22.master01/10, 22.master01/11, 22.master01/12, 22.master01/13, 22.master01/14, 22.master01/15, 22.master01/16, 22.master01/17, 22.master01/18, 22.master01/19, 25.master01/20, 25.master01/21, 25.master01/22, 25.master01/23, 25.master01/24, 25.master01/25, 25.master01/26, 25.master01/27, 25.master01/28, 25.master01/29, 25.master01/30, 25.master01/31, 25.master01/32, 25.master01/33, 25.master01/34, 25.master01/35, 25.master01/36, 25.master01/37, 25.master01/38, 25.master01/39
     resources_available.arch = linux
     resources_available.host = node05
     resources_available.mem = 528192512kb
     resources_available.ncpus = 80
     resources_available.ngpus = 8
     resources_available.vmem = 528125952kb
     resources_available.vnode = node05
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 40
     resources_assigned.vmem = 0kb
     queue = p100
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Aug 17 18:57:20 2021
     last_used_time = Tue Aug 17 21:32:30 2021

vamshi · August 17, 2021, 5:40pm

Haven’t noticed until now. I think that the issue is because of the value of the resources_assigned.ncpus. I do not remember assigning a value to it.
I tried to change it using qmgr as below: But it fails with the error -Cannot set attribute, read only or insufficient permission. How can I change it back to 0. 0 means “all”, right? I am running qmgr as root but don’t understand why the error?

Qmgr: set node node01 resources_assigned.ncpus=40
qmgr obj=node01 svr=default: Cannot set attribute, read only or insufficient permission  resources_assigned.ncpus
qmgr: Error (15003) returned from server

adarsh · August 17, 2021, 9:18pm

Could you please disable the pbs_cgroups hook ? and try to submit jobs.

qmgr -c “print hook @default”
qmgr -c “set hook pbs_cgroups enabled=false”
Also resource_assigned.ncpus is automatically populated by scheduler based on how many ncpus are used by the job on the compute node, hence this cannot be set using qmgr.

vamshi · August 18, 2021, 6:56am

Thank you @adarsh.
I have disabled the pbs_cgroups hook. But still, the jobs are held.
Also, the jobs never go to node03 and node04 in workq although they are free. The nodes - node01 and node02 are 50% occupied but no additional jobs are run there. How should I make sure all the resouces are used?

adarsh · August 18, 2021, 9:18pm

Thank you @vamshi

Please follow the below steps

tracejob
ssh
source /etc/pbs.conf ; cd $PBS_HOME/mom_logs/YYYYMMDD
Check the logs against that job

When the job is held

usually it is related to authentication, password not set, home directory missing, user unable to logon to the compute node and other system related

Please share the job script and output of this command $PBS_EXEC/unsupported/pbs_dtj

SpencerLeith · May 10, 2022, 9:26am

Did this ever get resolved?

vamshi · May 10, 2022, 11:20am

Yes. I was using bright cluster manager. I may have incorrectly configured the pbs somewhere. So, I reinstalled and configured PBS, created categories as per the cpu count and assigned nodes to a category and then nodes to each queue using qmgr. Then, it worked.

vinay · May 20, 2022, 2:25am

Hi Vamshi,

I was recently testing PBS, mine is CentOs7 and pbs20.01 . One head node and two nodes.

For testing I have just created a queue and added both nodes. But when I submit jobs such as -
#PBS -l nodes=2:ppn=1
Which I guess it should distribute on 2 nodes one process each.

But in reality it doesn’t. It runs only on one node.

Do you have documentation which you followed while reconfiguring PBS. Or any configuration which you think is needed.

in $PBS_HOME/sched_priv/sched_config
I have made load balancing true : load_balancing: true ALL

Is there anything else which I need to do.

vamshi · May 26, 2022, 10:43am

By default PBS chooses to use either free or pack placement, so that once the node is completely packed then it will pick the next node. So, to make the job run on two nodes, you should use

#PBS -l nodes=2:ppn=1
#PBS -l place=scatter

adarsh · May 26, 2022, 11:37am

Please use
#PBS -l select=2:ncpus=1
#PBS -l place=scatter

nodes / ppn are old syntax. Going forward use select and ncpus.

Topic		Replies	Views
Failed to assign resources when hypertheading is enabled Users/Site Administrators	3	913	April 22, 2021
Cgroup error causing suspended jobs Users/Site Administrators	17	3988	October 18, 2018
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1626	June 18, 2019
Resources_assigned.ncpus = 0 Users/Site Administrators	1	261	October 19, 2023
Jobs were not dispatched even though there were sufficient nodes and sufficient resources for appropriate node_pool Users/Site Administrators	5	66	September 9, 2024

Failed to assign resources to job

Related topics