We are seeing a lot of errors that looks like this in the mom logs:
pbs_mom; Processing error in pbs_cgroups handling execjob_begin event for job
CgroupProcessingError (‘Failed to assign resources’,)
They were discovered when a lot of jobs were being put into H state after trying and failing a number of times to be launched.
What’s going on here? Why is the scheduler giving jobs to machines that don’t have the resources and how can we stop it?
This can happen when the hook fails to clean up a cgroup for a previously running job. We refer to these as “orphaned” cgroups. There is code in the hook that attempts to identify and clean them up periodically. When an orphan is present the server and scheduler see the resources as available, but when the hook on the mom node tries to assign resources to the job it finds there aren’t enough due to the orphan. Please check within /sys/fs/cgroup/* for directories named “pbspro”. You’ll likely find subdirectories within them that correspond to job IDs. If you find an orphan, look at the mom log to see if there were any corresponding errors.
If this isn’t the underlying issue, there are other things we can explore.