GPU mapping in PBS

I am running an MPI + CUDA HPC code on a system using Open PBS with multiple nodes, each node having 8 NVIDIA GPUs.

On a SLURM cluster, I can use “pinning” in order to assign the MPI
rank to the CPUs that are closest to each of the GPUs.

For simplicity, let’s assume we have a node with 4 GPUs and 16 CPUs (or cores), and we want to pin 4 MPI tasks such that each task is associated with one GPU and 4 cores that are closest to it. Here’s a simplified version of how I might go about doing it:

#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4
#SBATCH --cpu-bind=cores
#SBATCH --gpu-bind=map_gpu:0,1,2,3
mpirun ./my_application

An alternative approach (which however does not minimize CPU-GPU latency) is to use CUDA_VISIBLE_DEVICES, like so:

#SBATCH --ntasks=4
#SBATCH --gres=gpu:4

mpirun -np 4 -x CUDA_VISIBLE_DEVICES=$SLURM_LOCALID ./my_application

How can I do this using PBS for maximising the performance of the code?

Use the cgroup hook and enable vnode_per_numa_node, it will make the scheduler aware of the topology. But if you’re spanning more than one socket then you still have to discover what process to pin where and which GPU to use from that process.

Thank you for the information!
There are indeed 2 sockets per node, 4 GPUs per socket. Do you think you could maybe provide some sample code to make this clearer?