Does anyone know of a way to track job usage of the GPUs via the accounting records? I kind of hoped that would happen automatically with the cgroups hook, but I see nothing in the accounting records. If it should and I am just missing a config, I would love to hear about that, but I suspect it wont be that easy.
If not, has anyone written a hook to do this? If so, I am curious how you went about it. Right now I am thinking we create a resource like gpu_pct, run an exechost_periodic to gather the data via our tool of choice and then during the execjob_epilogue set the value on each mom. Per section 126.96.36.199 of the 2022.1 manual, the numeric values should be summed over a multi-node job.
Does that sound reasonable? Any better ideas?
We are using Nvidia GPUs so will likely use nvidia-smi or dcgm. Anybody know of a generic tool that will work across all/most GPUs to make it more general?
For my understanding only:
Could you please explain this “job usage of the GPUs” ?
* is it job spent running on GPU cards ?
* gpu ulilization (%) with respect to a job ?
what kind of accounting information you would like to store in the accounting logs ?
what kind of reporting would be helpful ?
Altair does provide a dcgm hook that supposedly has this capability for nvidia devices ( although we haven’t had success in getting it working on 2021.1.3, supposedly better support for it coming in 2023 I am told)
If I do it, job usage will be gpu percent utilization from
If I understand the manual correctly, I can use the execjob_epilogue hook, set the resource I created, I used gpu_pct in my example, and it should be reported in the accounting record. So I do:
pbs.event().job.Resource_List["gpu_pct"] = 72
I would then expect to see something like this in the accounting log:
I might not have the syntax exactly right, but you get the idea.
Is there any documentation about it? I don’t see it in $PBS_HOME/server_priv/hooks.