Jobs held because of cgroup processing error

Hello,

We frequently experience lots of jobs getting Held because of an error we see in the mom logs

CgroupProcessingError (‘Failed to assign resources’,)

The resources for these job requests are in fact available when the job tries to run, but the jobs continue to be held when getting assigned to these certain nodes that are throwing this error. A reboot of the node fixes the issue, but has anyone else noticed this behavior from time to time with their environment? Constantly draining off nodes that are holding jobs is not very efficient.

As far as we can tell, there is no commonality between the resources requested from these jobs that get held. Some are high memory low cpu, some contain gpus, etc.

1 Like

Hello there

It is unpleasant but not common to have jobs delayed because of the cgroup processing error. This error usually means that even though resources are available at the time the job runs, there is an issue with distributing them. The problem is temporarily fixed by restarting the nodes, suggesting that there may be a node-specific and configuration-related issue. Check data for patterns and defects during job assignment in order to fix.