Hello,
While looking through our /var/log/messages, we noticed the job slice for 421319 & 421332 was created but never removed.
Normally a kill_job is issued about the same time we see the “PBS Pro job slice removed” notification.
Hoever, when we look at the node0089’s mom_log, we see:
01/21/2020 16:54:51;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.0095
01/21/2020 16:54:51;0008;pbs_mom;Job;421318.bright01-thx;kill_job
01/21/2020 16:55:22;0080;pbs_mom;Job;421319.bright01-thx;task 00000001 terminated
01/21/2020 16:55:22;0008;pbs_mom;Job;421319.bright01-thx;Terminated
01/21/2020 16:55:22;0100;pbs_mom;Job;421319.bright01-thx;task 00000001 cput= 5:58:30
01/21/2020 16:55:22;0008;pbs_mom;Job;421319.bright01-thx;kill_job
01/21/2020 16:55:22;0100;pbs_mom;Job;421319.bright01-thx;node0089 cput= 5:41:08 mem=226152kb
01/21/2020 16:55:22;0001;pbs_mom;Svr;pbs_mom;Cannot allocate memory (12) in fork_me, fork failed
01/21/2020 16:55:46;0008;pbs_mom;Job;421319.bright01-thx;no active tasks
01/21/2020 16:55:46;0080;pbs_mom;Job;421332.bright01-thx;task 00000001 terminated
01/21/2020 16:55:46;0008;pbs_mom;Job;421332.bright01-thx;Terminated
01/21/2020 16:55:46;0100;pbs_mom;Job;421332.bright01-thx;task 00000001 cput= 0:40:58
01/21/2020 16:55:46;0008;pbs_mom;Job;421332.bright01-thx;kill_job
01/21/2020 16:55:46;0100;pbs_mom;Job;421332.bright01-thx;node0089 cput= 0:38:58 mem=77857508kb
01/21/2020 16:55:46;0100;pbs_mom;Job;421319.bright01-thx;task 00000001 cput= 5:58:30
01/21/2020 16:55:46;0008;pbs_mom;Job;421319.bright01-thx;kill_job
01/21/2020 16:55:46;0100;pbs_mom;Job;421319.bright01-thx;node0089 cput= 5:41:08 mem=226152kb
Is it possible that the “fork failed” error is preventing our Cgroups hook from cleaning up slices previously created?
i.e. could we see errors such as there as a result:
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbspro.slice/pbspro-433147.bright01\x2dthx.slice/
01/26/2020 06:49:40;0002;pbs_python;Hook;pbs_python;configure_job: WARNING: mem_avail > vmem_avail
01/26/2020 06:49:40;0002;pbs_python;Hook;pbs_python;configure_job: Check reserve_amount and reserve_percent
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;configure_job: vmem not requested, assigning 125829120k to cgroup
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Failed to assign job resources
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Resyncing local job data
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Job not found: 433147.bright01-thx
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Failed to assign job resources
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Resyncing local job data
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Job not found: 433147.bright01-thx
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Requeuing job 433147.bright01-thx
01/26/2020 06:49:40;0100;pbs_python;Hook;pbs_python;Run count for job 433147.bright01-thx: 1
01/26/2020 06:49:40;0080;pbs_python;Hook;pbs_python;[‘Traceback (most recent call last):’, ’ File “”, line 4660, in main’, ’ File “”, line 728, in invoke_handler’, ’ File “”, line 758, in _execjob_begin_handler’, ’ File “”, line 3896, in configure_job’, ‘CgroupProcessingError: Failed to assign resources’]
01/26/2020 06:49:40;0001;pbs_python;Hook;pbs_python;Processing error in pbs_cgroups handling execjob_begin event for job 433147.bright01-thx: CgroupProcessingError (‘Failed to assign resources’,)
01/26/2020 06:49:40;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.7217