Cgroups not cleaning up after job kill, further jobs unable to assign resources

Hi,
We have an issue when a job was killed. It took more than the kill_delay of 180s and a force exit was initiated.

06/26/2023 18:57:25;0008;pbs_mom;Job;16033.hpc-p-fea-pbs1;kill_job
06/26/2023 18:57:25;0080;pbs_mom;Job;16033.hpc-p-fea-pbs1;task 00000001 terminated


06/26/2023 19:00:27;0080;pbs_mom;Job;16033.hpc-p-fea-pbs1;task 00000001 force exited
06/26/2023 19:00:27;0008;pbs_mom;Job;16033.hpc-p-fea-pbs1;Terminated
06/26/2023 19:00:28;0100;pbs_mom;Job;16033.hpc-p-fea-pbs1;task 00000001 cput=23:52:59
06/26/2023 19:00:28;0008;pbs_mom;Job;16033.hpc-p-fea-pbs1;kill_job

For some unknown reason the processes survived the force exit.

06/26/2023 19:00:50;0100;pbs_python;Hook;pbs_python;_kill_tasks: PID 37446 survived: [‘Name:\tstandard’, ‘State:\tD (disk sleep)’, ‘Uid:\t1570\t1570\t1570\t1570’]
06/26/2023 19:00:50;0100;pbs_python;Hook;pbs_python;_kill_tasks: PID 37447 survived: [‘Name:\tstandard’, ‘State:\tD (disk sleep)’, ‘Uid:\t1570\t1570\t1570\t1570’]
06/26/2023 19:00:50;0002;pbs_python;Hook;pbs_python;cgroup still has 15 tasks: /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/16033.hpc-p-fea-pbs1
06/26/2023 19:00:50;0100;pbs_python;Hook;pbs_python;delete: Unable to delete cgroup for job 16033.hpc-p-fea-pbs1

PBS attempted to take the node offline:
06/26/2023 19:00:50;0100;pbs_python;Hook;pbs_python;take_node_offline: Taking vnode(s) offline
06/26/2023 19:00:50;0100;pbs_python;Hook;pbs_python;take_node_offline: Hook pbs_cgroups: Unable to clean up one or more cgroups; offlining hpc-p-fea-n127
06/26/2023 19:00:50;0100;pbs_python;Hook;pbs_python;take_node_offline: Offline file already exists, not overwriting

Appears that process then must have cleared:
06/26/2023 19:00:50;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 16033.hpc-p-fea-pbs1, event_type 512 (elapsed time: 10.7606)
06/26/2023 19:00:50;0008;pbs_mom;Job;16033.hpc-p-fea-pbs1;no active tasks

The node is then brough back online:

06/26/2023 19:02:25;0080;pbs_python;Hook;pbs_python;bring_node_online: Vnode hpc-p-fea-n127 will be brought back online

Jobs are then attempted to launch on the node:
06/26/2023 19:06:28;0008;pbs_mom;Job;16033.hpc-p-fea-pbs1;no active tasks
06/26/2023 19:07:48;0100;pbs_mom;Req;;Type 1 request received from root@172.26.92.5:15001, sock=185
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;main: Event type is execjob_begin, job ID is 16041.hpc-p-fea-pbs1
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/systemd/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1/
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1/
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1/
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1/
06/26/2023 19:07:49;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: [‘nvidiactl’, ‘rwm’, ‘*’]
06/26/2023 19:07:49;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: [‘nvidia-uvm’, ‘rwm’]
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1/
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;configure_job: mem not requested, assigning 134321225728 to cgroup
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;configure_job: vmem not requested, assigning 138756698112 to cgroup
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;configure_job: Failed to assign job resources
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;configure_job: Resyncing local job data
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;<function caller_name at 0x147721ae4950>: Assignment of resources failed for 16041.hpc-p-fea-pbs1, attempting cleanup
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/systemd/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/systemd/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpu,cpuacct/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/devices/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/devices/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/memory/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/memory/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpuset/pbs_jobs.service/jobid/16041.hpc-p-fea-pbs1.orphan
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;configure_job: Requeuing job 16041.hpc-p-fea-pbs1
06/26/2023 19:07:49;0080;pbs_python;Hook;pbs_python;[‘Traceback (most recent call last):’, ’ File “”, line 6165, in main’, ’ File “”, line 1016, in invoke_handler’, ’ File “”, line 1065, in _execjob_begin_handler’, ’ File “”, line 5059, in configure_job’, ‘CgroupProcessingError: Failed to assign resources’]
06/26/2023 19:07:49;0001;pbs_python;Hook;pbs_python;Processing error in pbs_cgroups handling execjob_begin event for job 16041.hpc-p-fea-pbs1: CgroupProcessingError (‘Failed to assign resources’,)
06/26/2023 19:07:49;0100;pbs_python;Hook;pbs_python;Hook ended: pbs_cgroups, job ID 16041.hpc-p-fea-pbs1, event_type 64 (elapsed time: 0.5809)
06/26/2023 19:07:49;0100;pbs_mom;Hook;pbs_cgroups;execjob_begin request rejected by ‘pbs_cgroups’
06/26/2023 19:07:49;0008;pbs_mom;Job;16041.hpc-p-fea-pbs1;Processing error in pbs_cgroups handling execjob_begin event for job 16041.hpc-p-fea-pbs1: CgroupProcessingError (‘Failed to assign resources’,)
06/26/2023 19:07:49;0008;pbs_mom;Job;16033.hpc-p-fea-pbs1;no active tasks

Failed to assign resource as Job 16033 still has files on the node i.e:
/sys/fs/cgroup/systemd/pbs_jobs.service/jobid/16033.hpc-p-fea-pbs1/

Anyone know the best method to clean this up, guessing to restart the pbs service on the node?
More importantly how to stop it happening again in the future?
Some method to keep the node offline if the job files have not be cleaned?

Many thanks.

Modify the cgroup hook to move the cgroups to .orphan (Cfr. cleanup_orphans) in execjob_end on failure. It will keep the node offline until the orphan cgroup can be cleaned.

But the root cause is a process hung in the kernel doing I/O (the unkillable process is in state D). Filesystem client kernel modules are supposed to leave tasks in interruptible state after some time, this one apparently does not. Not much PBS can do.

There is a flag in the cgroup config file that controls whether nodes should be onlined again by the hook if they were offlined by it. Set that to false and you’ll be able to sanity check a node before putting it online again (if you do best also remove the file the hook sets to mark it offlined the node; not hard to find in the mom_priv tree).