Jobs held because of cgroup processing error

ryan · June 11, 2024, 7:27pm

Hello,

We frequently experience lots of jobs getting Held because of an error we see in the mom logs

CgroupProcessingError (‘Failed to assign resources’,)

The resources for these job requests are in fact available when the job tries to run, but the jobs continue to be held when getting assigned to these certain nodes that are throwing this error. A reboot of the node fixes the issue, but has anyone else noticed this behavior from time to time with their environment? Constantly draining off nodes that are holding jobs is not very efficient.

As far as we can tell, there is no commonality between the resources requested from these jobs that get held. Some are high memory low cpu, some contain gpus, etc.

Romana · June 24, 2024, 10:27am

Hello there

It is unpleasant but not common to have jobs delayed because of the cgroup processing error. This error usually means that even though resources are available at the time the job runs, there is an issue with distributing them. The problem is temporarily fixed by restarting the nodes, suggesting that there may be a node-specific and configuration-related issue. Check data for patterns and defects during job assignment in order to fix.

Topic		Replies	Views
CgroupProcessingError ('Failed to assign resources') Users/Site Administrators	1	1277	December 6, 2019
Failed to assign resources to job Users/Site Administrators	9	1827	May 26, 2022
Cgroups not cleaning up after job kill, further jobs unable to assign resources Users/Site Administrators	1	975	July 3, 2023
Jobs processes ending up in the wrong cgroup Users/Site Administrators	1	74	September 12, 2024
Cgroup error causing suspended jobs Users/Site Administrators	17	4111	October 18, 2018

Jobs held because of cgroup processing error

Related topics