Report GPU usage via the accounting records?

weallcock · August 17, 2022, 1:12pm

Does anyone know of a way to track job usage of the GPUs via the accounting records? I kind of hoped that would happen automatically with the cgroups hook, but I see nothing in the accounting records. If it should and I am just missing a config, I would love to hear about that, but I suspect it wont be that easy.

If not, has anyone written a hook to do this? If so, I am curious how you went about it. Right now I am thinking we create a resource like gpu_pct, run an exechost_periodic to gather the data via our tool of choice and then during the execjob_epilogue set the value on each mom. Per section 5.2.4.12 of the 2022.1 manual, the numeric values should be summed over a multi-node job.

Does that sound reasonable? Any better ideas?

We are using Nvidia GPUs so will likely use nvidia-smi or dcgm. Anybody know of a generic tool that will work across all/most GPUs to make it more general?

Thanks,

Bill

adarsh · August 17, 2022, 1:58pm

For my understanding only:

Could you please explain this “job usage of the GPUs” ?

      *  is it job spent running on GPU cards ? 
      *  gpu ulilization (%) with respect to a job ?

what kind of accounting information you would like to store in the accounting logs ?
what kind of reporting would be helpful ?

https://nvidia.custhelp.com/app/answers/detail/a_id/3751/~/useful-nvidia-smi-queries

wilshire · August 17, 2022, 8:31pm

Altair does provide a dcgm hook that supposedly has this capability for nvidia devices ( although we haven’t had success in getting it working on 2021.1.3, supposedly better support for it coming in 2023 I am told)

weallcock · August 17, 2022, 9:02pm

If I do it, job usage will be gpu percent utilization from nvidia-smi.
If I understand the manual correctly, I can use the execjob_epilogue hook, set the resource I created, I used gpu_pct in my example, and it should be reported in the accounting record. So I do:

pbs.event().job.Resource_List["gpu_pct"] = 72

I would then expect to see something like this in the accounting log:

resources_used.gpu_pct=72

I might not have the syntax exactly right, but you get the idea.

weallcock · August 17, 2022, 9:46pm

Is there any documentation about it? I don’t see it in $PBS_HOME/server_priv/hooks.

wilshire · August 18, 2022, 4:10pm

You can find it here:

weallcock · August 18, 2022, 4:30pm

Thanks for the link!

Topic		Replies	Views
GPU Access Limited by CGroup Users/Site Administrators	14	8394	June 13, 2018
More verbose accounting data Developers	1	528	July 19, 2022
How gpu_id allocated to physical gpu? Users/Site Administrators	5	4731	March 31, 2017
GPU memory as a custom resource Users/Site Administrators	6	3116	January 15, 2018
How get allocated gpus on each nodes Users/Site Administrators	11	2856	November 2, 2020

Report GPU usage via the accounting records?

Related topics