Job not running

Hi all,

Job is running perfectly for 48 cores in 1 node and also running for 24cores on 2 nodes.
A node is having a capacity of 48 cores.
If the same job is submitted on 2 nodes with 48 cores and it is not running and giving an error like the below.


There are not enough slots available in the system to satisfy the 96
slots that were requested by the application:

CitcomSFull

Either request fewer slots for your application, or make more slots
available for use.

A “slot” is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:

  1. Hostfile, via “slots=N” clauses (N defaults to number of
    processor cores if not provided)
  2. The --host command line parameter, via a “:N” suffix on the
    hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
    RM is present, Open MPI defaults to the number of processor cores

In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
–use-hwthread-cpus option.

Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.


And the job script is mentioned below.

#!/bin/bash
#PBS -N testjyo
#PBS -l nodes=2:ppn=48
#PBS -l walltime=72:00:00
#PBS -q qreg_3day_small
#PBS -l mem=6gb
#PBS -e error_reunion.log
#PBS -o output_reunion.log

cd $PBS_O_WORKDIR

module load libs/gmt-4.5.18
module load codes/citcoms-3.3.1
module load compilers/openmpi/4.1.1

mpirun -np 96 --oversubscribe CitcomSFull reunion_E15_CT200.input

Request help for the above issue.

Please try this script:

#!/bin/bash
#PBS -N testjyo
#PBS -l select=2:ncpus=48:mpiprocs=48:mem=6gb
#PBS -l walltime=72:00:00
#PBS -q qreg_3day_small
#PBS -e error_reunion.log
#PBS -o output_reunion.log
cd $PBS_O_WORKDIR
module load libs/gmt-4.5.18
module load codes/citcoms-3.3.1
module load compilers/openmpi/4.1.1
mpirun -np 96 --hostfile  $PBS_NODEFILE  --oversubscribe CitcomSFull reunion_E15_CT200.input

Also, go through this link Openmpi support - #7 by adarsh and check other openmpi topics in the forum.