Singularity containers with PBS

Hi,

I’m running MPI applications across multiple nodes on a HPC cluster with InfiniBand, using SLURM and Singularity containers.

My typical SLURM command within an sbatch script is:

srun -n $SLURM_NTASKS --mpi=pmix singularity exec ...

This example uses PMIx (but it also works with PMI2) to handle non-ABI-compatible MPI variants between the host and container, relying solely on the container’s MPI and PMIx libraries for bitwise reproducibility (no host MPI bindings).

What would be the equivalent command for PBS?

Additionally, have some of you already experience with that, and if yes, how did it perform compared to running the same MPI application directly on the host (without Singularity)?

Any insights, example scripts, or configuration tips (e.g., PBS directives, PMIx setup, InfiniBand optimization) would be greatly appreciated.

Thanks

Please check:

  • qsub -I (capital I for Ice) - check man qsub
  • pbsdsh (man page of pbsdsh below)
Intel MPI

#!/bin/bash
#PBS -N Intel_MPI
#PBS -l select=2:ncpus=1:mem=1gb:mpiprocs=1
#PBS -l place=scatter
#PBS -l walltime=10:00:00
cd $PBS_O_WORKDIR
source /etc/pbs.conf
export I_MPI_HYDRA_BOOTSTRAP=rsh
export I_MPI_HYDRA_BOOTSTRAP_EXEC=$PBS_EXEC/bin/pbs_tmrsh
export PATH=/intel/oneapi/mpi/2021.14/bin:$PATH
export LD_LIBRARY_PATH=/intel/oneapi/mpi/2021.14/lib:$LD_LIBRARY_PATH
echo `ls -l $PBS_NODEFILE`
echo $PBS_NODEFILE
mpirun /bin/sleep 10

OpenMPI (should be compiled from source with PBS TM libraries for tight integration)

#PBS -N OpenMPI
#PBS - select=2:ncpus=1:mpiprocs=1
#PBS -l place=scatter
cd $PBS_O_WORKDIR
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/openmpi507/lib

/openmpi507/bin/mpirun /bin/hostname

pbsdsh(1B)                                                                                     PBS Professional                                                                                    pbsdsh(1B)

NAME
       pbsdsh - distribute tasks to vnodes under PBS

SYNOPSIS
       pbsdsh [-c <copies>] [-s] [-v] [-o] -- <program> [<program args>]
       pbsdsh [-n <vnode index>] [-s] [-v] [-o] -- <program> [<program args>]
       pbsdsh --version

DESCRIPTION
       The pbsdsh command allows you to distribute and execute a task on each of the vnodes assigned to your job by executing (spawning) the application on each vnode.  The pbsdsh command uses the PBS Task
       Manager, or TM, to distribute the program on the allocated vnodes.

       When run without the -c or the -n option, pbsdsh will spawn the program on all vnodes allocated to the PBS job.  The spawns take place concurrently; all execute at (about) the same time.

       Note that the double dash must come after the options and before the program and arguments.  The double dash is only required for Linux.

       The pbsdsh command runs one task for each line in the $PBS_NODEFILE.  Each MPI rank gets a single line in the $PBS_NODEFILE, so if you are running multiple MPI ranks on the same host, you still  get
       multiple pbsdsh tasks on that host.

       Example
       The  following  example  shows  the pbsdsh command inside of a PBS batch job. The options indicate that the user wants pbsdsh to run the myapp program with one argument (app-arg1) on all four vnodes
       allocated to the job (i.e. the default behavior).

            #!/bin/sh
            #PBS -l select=4:ncpus=1
            #PBS -l walltime=1:00:00

            pbsdsh ./myapp app-arg1

OPTIONS
       -c copies
              The program is spawned copies times on the vnodes allocated, one per vnode, unless copies is greater than the number of vnodes.  If copies is greater than  the  number  of  vnodes,  it  wraps
              around, running multiple instances on some vnodes.  This option is mutually exclusive with -n.

       -n <vnode index>
              The program is spawned only on a single vnode, which is the vnode index -th vnode allocated.  This option is mutually exclusive with -c.

       -o     No obit request is made for spawned tasks.  The program does not wait for the tasks to finish.

       -s     Te program is run in turn on each vnode, one after the other.

       -v     Produces verbose output about error conditions and task exit status.

       --version
              The pbsdsh command returns its PBS version information and exits.  This option can only be used alone

OPERANDS
       program
              The first operand, program , is the program to execute.  The double dash must precede the program under Linux.

       program args
              Additional operands, program args , are passed as arguments to the program.

STANDARD ERROR
       The pbsdsh command writes a diagnostic message to standard error for each error occurrence.

SEE ALSO
       qsub(1B), tm(3).

Local                                                                                             6 May 2020                                                                                       pbsdsh(1B)

Thanks @adarsh but I do not have access to a machine with PBS at the moment, I am just trying to figure out how to “transpose” what I did with SLURM and containers to leverage the PMI (process management interface) if that is possible at all, not just submit massively parallel jobs: there has to be message passing between them

Please try this
srun --mpi=pmix equivalent is mpiexec -np $NP singularity exec

#PBS -N sinugularity_mpi
#PBS -l select=1:ncpus=32:mpiprocs=32:mem=32gb

NP=`cat $PBS_NODEFILE | wc -l`
mpiexec -np $NP singularity exec <image.sif> <mpi_program>
or
IMAGE=my.sif
CMD=/user/local/bin/mpiprog
EXPORT=/myproject
mpirun --hostfile $PBS_NODEFILE -np $NP env LD_LIBRARY_PATH=$LD_LIBRARY_PATH PATH=$PATH singularity exec --bind $PBS_O_WORKDIR $PBS_O_WORKDIR --bind $EXPORT $EXPORT $IMAGE $CMD**

if in case i misundersood your question, please clarify

Hi again,

With mpiexec -np $NP singularity exec ... the MPIs on the host and inside the container have to be ABI compatible, otherwise it fails, whereas with srun --mpi=pmix singularity ... this is not necessary and they only have to share the same PMI (PMI2, PMIx, etc.)

Does it make any sense?

Thank you @JeanI

openpbs uses mpiexec as the default launcher. There is no similar commandline interface as srun --mpi=pmix

These options are to look at