We’ve been trying to use the cgroups hook to do resource management on our GPU nodes. However, since implementing the hook, the GPU nodes have been rejecting jobs with the comment “Not Running: PBS Error: Execution server rejected request” the job is then re-queued until the re-run limit is reached and the job is held.
The MoM logs for this job:
10/31/2019 14:15:18;0080;pbs_mom;Hook;pbs_cgroups.HK;copy hook-related file request received
10/31/2019 14:15:19;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Failed to read vntype file /cm/local/apps/pbspro-ce/var/spool/mom_priv/vntype
10/31/2019 14:15:19;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Could not determine vntype
10/31/2019 14:15:19;0080;pbs_python;Hook;pbs_python;Failed to open cgroup_jobs file.
10/31/2019 14:15:19;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.0075
A “vntype” file does not exist in the location it is looking and I’m not sure if that is even what is leading to this error, since the documentation seems to suggest the vntype parameter is only necessary for Cray nodes. Similarly, from reading the hook script, it seems that the cgroups_job file is written as a temporary store for jobids, so it seems strange that the script is unable to read the file.
Does anyone know how this error is coming about? Any suggestions would be greatly appreciated.
I can provide more information as needed; I did not want to bombard the post with information that may be extraneous.
It looks like there may be some stray cgroups that have not been cleaned up. This can cause mom to reject incoming jobs on the basis of there being insufficient resources. Please check for the existence of /sys/fs/cgroup//pbspro/ directories when mom is stopped. These should be cleaned up manually (as root).
It looks like you’re using the cgroups hook in conjunction with Bright Cluster Manager. There should be no problem with this. However, I’ve seen some issues with host naming that have some unexpected results. Perhaps @scc would care to comment?
Hello, first, you can ignore the vntype messages, they are not causing your problem.
I happened to notice that your “Failed to open cgroup_jobs file.” error messasge contains a period at the end. That period was removed in this change: https://github.com/PBSPro/pbspro/pull/678, which was a VERY significant improvement to support Linux systems running systemd. Since that change there have also been major fixes surrounding orphaned cgroup identification and removal (https://github.com/PBSPro/pbspro/pull/1148, though that is not yet in any tagged release).
I can’t point to a specific problem in the old hook that might be causing your problem, but given the large number of massive changes to the cgroups implementation I’d recommend upgrading to version 19.1.3. In addition,if you are feeling more adventurous, grabbing an even newer revision of pbs_cgroups.PY, .HK, and .CF might also help (specifically the second URL I cited above, but there are of course other improvements). This will be more difficult, though, as I am pretty sure grabbing the current master hook will not “just work” with 19.1.3 without modification due to changes to handle newer hook events that have been added, Python 3 compatibility, etc.