Nvidia MIG Support

From the nvidia documentation, “the new Multi-Instance GPU (MIG) feature allows the NVIDIA A100 GPU to be securely partitioned into up to seven separate GPU Instances for CUDA applications”

We will support it through the cgroups hook.

Here’s the design: https://openpbs.atlassian.net/wiki/spaces/PD/pages/2313453569/Nvidia+MIG+Support

I added a bit more to the “how it works” section. @bayucan Since you have more experience with the cgroups hook, can you take a look?

@vstumpf It looks like it’s doable under cgroups hook. I see that you also looked into CUDA_VISIBLE_DEVICES, and likely\ need to add more to the “allow” devices list.
By the way, “MIG GPU” might be redundant in the doc as MIG is already defined as “MIG = Multi Instance GPU.”

Thanks Al! You’re right, MIG GPU is as redundant as PIN number or ATM machine. :slight_smile:

I’ll fix that.

Hey, I’ve updated the format as to how the hook specifies MIG device UUIDs in the CUDA_VISIBLE_DEVICES env variable since the old ‘tuple’ format was going out of support. Now I’m getting the MIG UUIDs via the nvidia-smi -L command.
Here is the link to the PR