Struggling to submite a (ORCA computational chemistry) parallel job to PBS Pro

Dear colleagues, how are you?
We were granted access to a large cluster to run our calculations and the cluster runs on Suse Linux with PBS to manage the jobs. But there is a catch: the IT guys are offering no support on the configuration of any software whatsoever.
The cluster offers several openmpi options loaded with “module load option”:

henriquecsj@service1:~/Co/SP> module load openmpi
openmpi/1.10.2/2016  openmpi/2.1.2/2018   openmpi-gnu/3.0.0    openmpi-gnu/4.0.1    openmpi-intel/3.0.0  openmpi-intel/4.0.1
openmpi/1.10.2/2017  openmpi-gnu          openmpi-gnu/3.1.2    openmpi-intel        openmpi-intel/3.1.2
openmpi/2.1.2/2017   openmpi-gnu/2.1.1    openmpi-gnu/4.0.0    openmpi-intel/2.1.1  openmpi-intel/4.0.0

And based on the example script offered here in the forum I as able to submit jobs.

#!/bin/bash
#PBS -l select=2:ncpus=48:ompthreads=24
#PBS -j oe
#PBS -V
#PBS -N OpenMP

# Usage of this script:
#qsub -N jobname job-orca.sh  , where jobname is the name of your ORCA inputfile (jobname.inp) without the .inp extension

# Jobname below is set automatically when using "qsub -N jobname job-orca.sh ". Can alternatively be set manually here. Should be the name of the inputfile without extension (.inp or whatever).
export job=$PBS_JOBNAME

#Loading OPENMPI here:
module load openmpi-gnu/3.1.2

# Here giving the path to the ORCA binaries and giving communication protocol
export orcadir=/home/users/henriquecsj/bin/orca
export RSH_COMMAND="/usr/bin/ssh -x"
export PATH=$orcadir:$PATH

# Creating local scratch folder for the user on the computing node. /scratch directory must exist. 
if [ ! -d /scratch/60002a/$USER ]
then
  mkdir -p /scratch/60002a/$USER
fi
tdir=$(mktemp -d /scratch/60002a/$USER/orcajob__$PBS_JOBID-XXXX)

# Copy only the necessary files for ORCA from submit directory to scratch directory: inputfile, xyz-files, GBW-file etc.
# Add more here if needed.
cp $PBS_O_WORKDIR/*.inp $tdir/
cp $PBS_O_WORKDIR/*.gbw $tdir/
cp $PBS_O_WORKDIR/*.xyz $tdir/
cp $PBS_O_WORKDIR/*.hess $tdir/
cp $PBS_O_WORKDIR/*.pc $tdir/


# Creating nodefile in scratch
cat ${PBS_NODEFILE} > $tdir/$job.nodes

# cd to scratch
cd $tdir

# Copy job and node info to beginning of outputfile
echo "Job execution start: $(date)" >> $PBS_O_WORKDIR/$job.out
echo "Shared library path: $LD_LIBRARY_PATH" >> $PBS_O_WORKDIR/$job.out
echo "PBS Job ID is: ${PBS_JOBID}" >> $PBS_O_WORKDIR/$job.out
echo "PBS Job name is: ${PBS_JOBNAME}" >> $PBS_O_WORKDIR/$job.out
cat $PBS_NODEFILE >> $PBS_O_WORKDIR/$job.out

#Start ORCA job. ORCA is started using full pathname (necessary for parallel execution). Output file is written directly to submit directory on frontnode.
$orcadir/orca $tdir/$job.inp >> $PBS_O_WORKDIR/$job.out

# ORCA has finished here. Now copy important stuff back (xyz files, GBW files etc.). Add more here if needed.
cp $tdir/*.gbw $PBS_O_WORKDIR
cp $tdir/*.engrad $PBS_O_WORKDIR
cp $tdir/*.xyz $PBS_O_WORKDIR
cp $tdir/*.loc $PBS_O_WORKDIR
cp $tdir/*.qro $PBS_O_WORKDIR
cp $tdir/*.uno $PBS_O_WORKDIR
cp $tdir/*.unso $PBS_O_WORKDIR
cp $tdir/*.uco $PBS_O_WORKDIR
cp $tdir/*.hess $PBS_O_WORKDIR
cp $tdir/*.cis $PBS_O_WORKDIR
cp $tdir/*.dat $PBS_O_WORKDIR
cp $tdir/*.mp2nat $PBS_O_WORKDIR
cp $tdir/*.nat $PBS_O_WORKDIR
cp $tdir/*.scfp_fod $PBS_O_WORKDIR
cp $tdir/*.scfp $PBS_O_WORKDIR
cp $tdir/*.scfr $PBS_O_WORKDIR
cp $tdir/*.nbo $PBS_O_WORKDIR
cp $tdir/FILE.47 $PBS_O_WORKDIR
cp $tdir/*_property.txt $PBS_O_WORKDIR
cp $tdir/*spin* $PBS_O_WORKDIR

I can see by the logs that the jobs are propagating to the nodes, copying files and dying because they can’t correctly run mpirun as needed. Here is the log from the nodes:

cp: cannot stat ‘/home/users/henriquecsj/Co/SP/*.hess’: No such file or directory
cp: cannot stat ‘/home/users/henriquecsj/Co/SP/*.pc’: No such file or directory
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 24 slots
that were requested by the application:
  /home/users/henriquecsj/bin/orca/orca_gtoint_mpi

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[file orca_tools/qcmsg.cpp, line 458]:
  .... aborting the run

And here is the main output:

Checking for AutoStart:
The File: /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.gbw exists
Trying to determine its content:
     ... Fine, the file contains calculation information
     ... Fine, the calculation information was read
     ... Fine, the file contains a basis set
     ... Fine, the basis set was read
     ... Fine, the file contains a geometry
     ... Fine, the geometry was read
     ... The file does not contain orbitals - skipping AutoStart

ORCA finished by error termination in GTOInt
Calling Command: mpirun -np 24  -machinefile /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.nodes /home/users/henriquecsj/bin/orca/o
rca_gtoint_mpi /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.int.tmp /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3

[file orca_tools/qcmsg.cpp, line 458]:
  .... aborting the run

Because we are totally blind here, could anyone provide a clue on what is happening?

P.S.: My input is asking for nprocs 48

  1. Please share the PBS Version ( qstat --version )
  2. Was OpenMPI compiled from source with PBS TM API
    TM api on pbspro does not work for me?
    Compile OpenMPI with PBSpro 14.1.10
  3. Could you please simple OpenMPI + PBS job , before trying out your above example, to make sure it works.

use mpirun --oversubscribe

mpirun --oversubscribe  -np 24  -machinefile /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.nodes /home/users/henriquecsj/bin/orca/o
rca_gtoint_mpi /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.int.tmp /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3

Thank you so much for your help, guys.
I was able to run my job using

#PBS -l select=1:ncpus=24:mpiprocs=24

Now I’m trying to figure out how to request more threads for my calculations. If I want to request 48 threads which one is the correct option?

#PBS -l select=2:ncpus=48:mpiprocs=24

or

#PBS -l select=2:ncpus=24:mpiprocs=24

Thanks, guys.

1 Like