We have recently migrated our cluster to new versions of quite a few pieces of our software (PBS, RHEL, Cluster manager) and we are encountering a new issue with our cgroups.
Semi-randomly one of the PBS cgroups that are meant for a single job /pbspro.service/jobid/7423.server0/ for example, will contain more processes than it should from other jobs.
When the job which corresponds to that job id finishes it takes all of the other processes with it causing the jobs to get killed.
From my looking our issue seems to be the exact same as this
I can provide additional information if needed. Has anyone ever encountered this issues before? From our tests this is pretty widespread and could affect any node that has multiple jobs on it, but it is not consistent.
Thank you,
Aaron