From time to time, a node is offlined by pbs with “offlined by hook ‘pbs_cgroups’ due to hook error” note displayed by
pbsnodes -c typically fixes it; sometimes I need to restart the pbs server on the affected node. The log there has entries like this:
12/02/2021 17:11:19;0001;pbs_mom;Svr;pbs_mom;run_hook, execv of /opt/pbs/bin/pbs_python resulted in nonzero exit status=-4. How to properly debug this?