Cgroups hook not cleaning up slices appropriately?

Hello,

While looking through our /var/log/messages, we noticed the job slice for 421319 & 421332 was created but never removed.

Normally a kill_job is issued about the same time we see the “PBS Pro job slice removed” notification.

Hoever, when we look at the node0089’s mom_log, we see:

01/21/2020 16:54:51;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.0095

01/21/2020 16:54:51;0008;pbs_mom;Job;421318.bright01-thx;kill_job

01/21/2020 16:55:22;0080;pbs_mom;Job;421319.bright01-thx;task 00000001 terminated

01/21/2020 16:55:22;0008;pbs_mom;Job;421319.bright01-thx;Terminated

01/21/2020 16:55:22;0100;pbs_mom;Job;421319.bright01-thx;task 00000001 cput= 5:58:30

01/21/2020 16:55:22;0008;pbs_mom;Job;421319.bright01-thx;kill_job

01/21/2020 16:55:22;0100;pbs_mom;Job;421319.bright01-thx;node0089 cput= 5:41:08 mem=226152kb

01/21/2020 16:55:22;0001;pbs_mom;Svr;pbs_mom;Cannot allocate memory (12) in fork_me, fork failed

01/21/2020 16:55:46;0008;pbs_mom;Job;421319.bright01-thx;no active tasks

01/21/2020 16:55:46;0080;pbs_mom;Job;421332.bright01-thx;task 00000001 terminated

01/21/2020 16:55:46;0008;pbs_mom;Job;421332.bright01-thx;Terminated

01/21/2020 16:55:46;0100;pbs_mom;Job;421332.bright01-thx;task 00000001 cput= 0:40:58

01/21/2020 16:55:46;0008;pbs_mom;Job;421332.bright01-thx;kill_job

01/21/2020 16:55:46;0100;pbs_mom;Job;421332.bright01-thx;node0089 cput= 0:38:58 mem=77857508kb

01/21/2020 16:55:46;0100;pbs_mom;Job;421319.bright01-thx;task 00000001 cput= 5:58:30

01/21/2020 16:55:46;0008;pbs_mom;Job;421319.bright01-thx;kill_job

01/21/2020 16:55:46;0100;pbs_mom;Job;421319.bright01-thx;node0089 cput= 5:41:08 mem=226152kb

Is it possible that the “fork failed” error is preventing our Cgroups hook from cleaning up slices previously created?

i.e. could we see errors such as there as a result:

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbspro.slice/pbspro-433147.bright01\x2dthx.slice/

01/26/2020 06:49:40;0002;pbs_python;Hook;pbs_python;configure_job: WARNING: mem_avail > vmem_avail

01/26/2020 06:49:40;0002;pbs_python;Hook;pbs_python;configure_job: Check reserve_amount and reserve_percent

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;configure_job: vmem not requested, assigning 125829120k to cgroup

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Failed to assign job resources

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Resyncing local job data

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Job not found: 433147.bright01-thx

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Failed to assign job resources

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Resyncing local job data

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Job not found: 433147.bright01-thx

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Requeuing job 433147.bright01-thx

01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Run count for job 433147.bright01-thx: 1

01/26/2020 06:49:40;0080;pbs_python;Hook;pbs_python;[‘Traceback (most recent call last):’, ’ File “”, line 4660, in main’, ’ File “”, line 728, in invoke_handler’, ’ File “”, line 758, in _execjob_begin_handler’, ’ File “”, line 3896, in configure_job’, ‘CgroupProcessingError: Failed to assign resources’]

01/26/2020 06:49:40;0001;pbs_python;Hook;pbs_python;Processing error in pbs_cgroups handling execjob_begin event for job 433147.bright01-thx: CgroupProcessingError (‘Failed to assign resources’,)

01/26/2020 06:49:40;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.7217