Report GPU usage via the accounting records?

Does anyone know of a way to track job usage of the GPUs via the accounting records? I kind of hoped that would happen automatically with the cgroups hook, but I see nothing in the accounting records. If it should and I am just missing a config, I would love to hear about that, but I suspect it wont be that easy.

If not, has anyone written a hook to do this? If so, I am curious how you went about it. Right now I am thinking we create a resource like gpu_pct, run an exechost_periodic to gather the data via our tool of choice and then during the execjob_epilogue set the value on each mom. Per section of the 2022.1 manual, the numeric values should be summed over a multi-node job.

Does that sound reasonable? Any better ideas?

We are using Nvidia GPUs so will likely use nvidia-smi or dcgm. Anybody know of a generic tool that will work across all/most GPUs to make it more general?



For my understanding only:

  1. Could you please explain this “job usage of the GPUs” ?

          *  is it job spent running on GPU cards ? 
          *  gpu ulilization (%) with respect to a job ? 
  2. what kind of accounting information you would like to store in the accounting logs ?

  3. what kind of reporting would be helpful ?

Altair does provide a dcgm hook that supposedly has this capability for nvidia devices ( although we haven’t had success in getting it working on 2021.1.3, supposedly better support for it coming in 2023 I am told)

  1. If I do it, job usage will be gpu percent utilization from nvidia-smi.

  2. If I understand the manual correctly, I can use the execjob_epilogue hook, set the resource I created, I used gpu_pct in my example, and it should be reported in the accounting record. So I do:

pbs.event().job.Resource_List["gpu_pct"] = 72

I would then expect to see something like this in the accounting log:


I might not have the syntax exactly right, but you get the idea.

1 Like

Is there any documentation about it? I don’t see it in $PBS_HOME/server_priv/hooks.

You can find it here:

1 Like

Thanks for the link!