Hello,
We frequently experience lots of jobs getting Held because of an error we see in the mom logs
CgroupProcessingError (‘Failed to assign resources’,)
The resources for these job requests are in fact available when the job tries to run, but the jobs continue to be held when getting assigned to these certain nodes that are throwing this error. A reboot of the node fixes the issue, but has anyone else noticed this behavior from time to time with their environment? Constantly draining off nodes that are holding jobs is not very efficient.
As far as we can tell, there is no commonality between the resources requested from these jobs that get held. Some are high memory low cpu, some contain gpus, etc.