Error while implementing cgroups hook - "Execution server rejected request"

Hello all,

We’ve been trying to use the cgroups hook to do resource management on our GPU nodes. However, since implementing the hook, the GPU nodes have been rejecting jobs with the comment “Not Running: PBS Error: Execution server rejected request” the job is then re-queued until the re-run limit is reached and the job is held.

The MoM logs for this job:

10/31/2019 14:15:18;0080;pbs_mom;Hook;pbs_cgroups.HK;copy hook-related file request received
10/31/2019 14:15:19;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Failed to read vntype file /cm/local/apps/pbspro-ce/var/spool/mom_priv/vntype
10/31/2019 14:15:19;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Could not determine vntype
10/31/2019 14:15:19;0080;pbs_python;Hook;pbs_python;Failed to open cgroup_jobs file.
10/31/2019 14:15:19;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.0075

A “vntype” file does not exist in the location it is looking and I’m not sure if that is even what is leading to this error, since the documentation seems to suggest the vntype parameter is only necessary for Cray nodes. Similarly, from reading the hook script, it seems that the cgroups_job file is written as a temporary store for jobids, so it seems strange that the script is unable to read the file.

Does anyone know how this error is coming about? Any suggestions would be greatly appreciated.

I can provide more information as needed; I did not want to bombard the post with information that may be extraneous.

After increasing the verbosity of the logging, I have the following MoM log:

10/31/2019 17:06:11;0100;pbs_mom;Req;;Type 1 request received from root@10.30.255.254:15001, sock=1
10/31/2019 17:06:11;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Failed to read vntype file /cm/local/apps/pbspro-ce/var/spool/mom_priv/vntype
10/31/2019 17:06:11;0100;pbs_python;Hook;pbs_python;_get_vnode_type: Could not determine vntype
10/31/2019 17:06:15;0080;pbs_python;Hook;pbs_python;Failed to open cgroup_jobs file.
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/memory/pbspro/323225.bright01-thx/
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpuset/pbspro/323225.bright01-thx/
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/cpu,cpuacct/pbspro/323225.bright01-thx/
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;create_job: Creating directory /sys/fs/cgroup/devices/pbspro/323225.bright01-thx/
10/31/2019 17:06:15;0080;pbs_python;Hook;pbs_python;_setup_subsys_devices: Entry not added to devices.allow: [‘nvidia-uvm’, ‘rwm’]
10/31/2019 17:06:15;0002;pbs_python;Hook;pbs_python;configure_job: WARNING: mem_avail > vmem_avail
10/31/2019 17:06:15;0002;pbs_python;Hook;pbs_python;configure_job: Check reserve_amount and reserve_percent
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;configure_job: vmem not requested, assigning 25165824k to cgroup
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;configure_job: WARNING: vmem is enabled in the hook configuration file and should also be listed in the resources line of the scheduler configuration file
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;Failed to assign job resources
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;Resyncing local job data
10/31/2019 17:06:15;0080;pbs_python;Hook;pbs_python;Failed to open cgroup_jobs file.
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/memory/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/memory/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/cpuset/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpuset/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/cpu,cpuacct/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/cpu,cpuacct/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;cleanup_orphans: Removing orphaned cgroup: /sys/fs/cgroup/devices/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:15;0100;pbs_python;Hook;pbs_python;_remove_cgroup: Removing directory /sys/fs/cgroup/devices/pbspro/323225.bright01-thx-orphan
10/31/2019 17:06:16;0100;pbs_python;Hook;pbs_python;Failed to assign job resources
10/31/2019 17:06:16;0100;pbs_python;Hook;pbs_python;Resyncing local job data
10/31/2019 17:06:16;0080;pbs_python;Hook;pbs_python;Failed to open cgroup_jobs file.
10/31/2019 17:06:16;0100;pbs_python;Hook;pbs_python;Requeuing job 323225.bright01-thx
10/31/2019 17:06:16;0100;pbs_python;Hook;pbs_python;Run count for job 323225.bright01-thx: 22
10/31/2019 17:06:16;0080;pbs_python;Hook;pbs_python;Elapsed time: 4.8220
10/31/2019 17:06:16;0100;pbs_mom;Hook;pbs_cgroups;execjob_begin request rejected by ‘pbs_cgroups’
10/31/2019 17:06:16;0008;pbs_mom;Job;323225.bright01-thx;Failed to assign resources
10/31/2019 17:06:16;0100;pbs_mom;Req;;Type 3 request received from root@10.30.255.254:15001, sock=1
10/31/2019 17:06:16;0080;pbs_mom;Req;req_reject;Reject reply code=15004, aux=0, type=3, from root@10.30.255.254:15001
10/31/2019 17:06:16;0100;pbs_mom;Req;;Type 5 request received from root@10.30.255.254:15001, sock=1
10/31/2019 17:06:16;0080;pbs_mom;Req;req_reject;Reject reply code=15001, aux=0, type=5, from root@10.30.255.254:15001

A couple things that stand out…

It looks like there may be some stray cgroups that have not been cleaned up. This can cause mom to reject incoming jobs on the basis of there being insufficient resources. Please check for the existence of /sys/fs/cgroup//pbspro/ directories when mom is stopped. These should be cleaned up manually (as root).

It looks like you’re using the cgroups hook in conjunction with Bright Cluster Manager. There should be no problem with this. However, I’ve seen some issues with host naming that have some unexpected results. Perhaps @scc would care to comment?

Thanks,

Mike

Hello, first, you can ignore the vntype messages, they are not causing your problem.

I happened to notice that your “Failed to open cgroup_jobs file.” error messasge contains a period at the end. That period was removed in this change: https://github.com/PBSPro/pbspro/pull/678, which was a VERY significant improvement to support Linux systems running systemd. Since that change there have also been major fixes surrounding orphaned cgroup identification and removal (https://github.com/PBSPro/pbspro/pull/1148, though that is not yet in any tagged release).

I can’t point to a specific problem in the old hook that might be causing your problem, but given the large number of massive changes to the cgroups implementation I’d recommend upgrading to version 19.1.3. In addition,if you are feeling more adventurous, grabbing an even newer revision of pbs_cgroups.PY, .HK, and .CF might also help (specifically the second URL I cited above, but there are of course other improvements). This will be more difficult, though, as I am pretty sure grabbing the current master hook will not “just work” with 19.1.3 without modification due to changes to handle newer hook events that have been added, Python 3 compatibility, etc.

I hope this helps.