CgroupProcessingError ('Failed to assign resources')

datakid · December 6, 2019, 3:01am

We are seeing a lot of errors that looks like this in the mom logs:
pbs_mom; Processing error in pbs_cgroups handling execjob_begin event for job
CgroupProcessingError (‘Failed to assign resources’,)

They were discovered when a lot of jobs were being put into H state after trying and failing a number of times to be launched.

What’s going on here? Why is the scheduler giving jobs to machines that don’t have the resources and how can we stop it?

mkaro · December 6, 2019, 6:45pm

This can happen when the hook fails to clean up a cgroup for a previously running job. We refer to these as “orphaned” cgroups. There is code in the hook that attempts to identify and clean them up periodically. When an orphan is present the server and scheduler see the resources as available, but when the hook on the mom node tries to assign resources to the job it finds there aren’t enough due to the orphan. Please check within /sys/fs/cgroup/* for directories named “pbspro”. You’ll likely find subdirectories within them that correspond to job IDs. If you find an orphan, look at the mom log to see if there were any corresponding errors.

If this isn’t the underlying issue, there are other things we can explore.

Topic		Replies	Views
Cgroups not cleaning up after job kill, further jobs unable to assign resources Users/Site Administrators	1	874	July 3, 2023
Jobs held because of cgroup processing error Users/Site Administrators	1	100	June 24, 2024
Failed to assign resources to job Users/Site Administrators	9	1699	May 26, 2022
PBS job fails due to orphaned cgroups Users/Site Administrators	0	141	June 1, 2024
Cgroups hook not cleaning up slices appropriately? Users/Site Administrators	0	1049	January 27, 2020

CgroupProcessingError ('Failed to assign resources')

Related topics