Jobs processes ending up in the wrong cgroup

AaronJanssen · September 9, 2024, 9:17pm

We have recently migrated our cluster to new versions of quite a few pieces of our software (PBS, RHEL, Cluster manager) and we are encountering a new issue with our cgroups.

Semi-randomly one of the PBS cgroups that are meant for a single job /pbspro.service/jobid/7423.server0/ for example, will contain more processes than it should from other jobs.

When the job which corresponds to that job id finishes it takes all of the other processes with it causing the jobs to get killed.

From my looking our issue seems to be the exact same as this

I can provide additional information if needed. Has anyone ever encountered this issues before? From our tests this is pretty widespread and could affect any node that has multiple jobs on it, but it is not consistent.

Thank you,
Aaron

AaronJanssen · September 12, 2024, 4:17pm

The external application in our case was Bright Cluster Manager, I guess it was a known bug.

The fix we used was to update our versions of our daemons from Bright on our compute nodes.

Topic		Replies	Views
Cgroups not cleaning up after job kill, further jobs unable to assign resources Users/Site Administrators	1	859	July 3, 2023
Jobs are killed because of "post job file processing error" Users/Site Administrators	3	2406	November 3, 2023
PBS job fails due to orphaned cgroups Users/Site Administrators	0	136	June 1, 2024
Cgroups hook not cleaning up slices appropriately? Users/Site Administrators	0	1049	January 27, 2020
CgroupProcessingError ('Failed to assign resources') Users/Site Administrators	1	1192	December 6, 2019

Jobs processes ending up in the wrong cgroup

Related topics