I am running an MPI + CUDA HPC code on a system using Open PBS
with multiple nodes, each node having 8 NVIDIA GPUs.
On a SLURM
cluster, I can use “pinning” in order to assign the MPI
rank to the CPUs that are closest to each of the GPUs.
For simplicity, let’s assume we have a node with 4 GPUs and 16 CPUs (or cores), and we want to pin 4 MPI tasks such that each task is associated with one GPU and 4 cores that are closest to it. Here’s a simplified version of how I might go about doing it:
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:4
#SBATCH --cpu-bind=cores
#SBATCH --gpu-bind=map_gpu:0,1,2,3
mpirun ./my_application
An alternative approach (which however does not minimize CPU-GPU latency) is to use CUDA_VISIBLE_DEVICES
, like so:
#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --gres=gpu:4
mpirun -np 4 -x CUDA_VISIBLE_DEVICES=$SLURM_LOCALID ./my_application
How can I do this using PBS
for maximising the performance of the code?