Does anyone know of a way to track job usage of the GPUs via the accounting records? I kind of hoped that would happen automatically with the cgroups hook, but I see nothing in the accounting records. If it should and I am just missing a config, I would love to hear about that, but I suspect it wont be that easy.
If not, has anyone written a hook to do this? If so, I am curious how you went about it. Right now I am thinking we create a resource like gpu_pct, run an exechost_periodic to gather the data via our tool of choice and then during the execjob_epilogue set the value on each mom. Per section 5.2.4.12 of the 2022.1 manual, the numeric values should be summed over a multi-node job.
Does that sound reasonable? Any better ideas?
We are using Nvidia GPUs so will likely use nvidia-smi or dcgm. Anybody know of a generic tool that will work across all/most GPUs to make it more general?
Altair does provide a dcgm hook that supposedly has this capability for nvidia devices ( although we haven’t had success in getting it working on 2021.1.3, supposedly better support for it coming in 2023 I am told)
If I do it, job usage will be gpu percent utilization from nvidia-smi.
If I understand the manual correctly, I can use the execjob_epilogue hook, set the resource I created, I used gpu_pct in my example, and it should be reported in the accounting record. So I do:
pbs.event().job.Resource_List["gpu_pct"] = 72
I would then expect to see something like this in the accounting log:
resources_used.gpu_pct=72
I might not have the syntax exactly right, but you get the idea.