Struggling to submite a (ORCA computational chemistry) parallel job to PBS Pro

HenriqueCSJ · May 28, 2020, 2:13pm

Dear colleagues, how are you?
We were granted access to a large cluster to run our calculations and the cluster runs on Suse Linux with PBS to manage the jobs. But there is a catch: the IT guys are offering no support on the configuration of any software whatsoever.
The cluster offers several openmpi options loaded with “module load option”:

henriquecsj@service1:~/Co/SP> module load openmpi
openmpi/1.10.2/2016  openmpi/2.1.2/2018   openmpi-gnu/3.0.0    openmpi-gnu/4.0.1    openmpi-intel/3.0.0  openmpi-intel/4.0.1
openmpi/1.10.2/2017  openmpi-gnu          openmpi-gnu/3.1.2    openmpi-intel        openmpi-intel/3.1.2
openmpi/2.1.2/2017   openmpi-gnu/2.1.1    openmpi-gnu/4.0.0    openmpi-intel/2.1.1  openmpi-intel/4.0.0

And based on the example script offered here in the forum I as able to submit jobs.

#!/bin/bash
#PBS -l select=2:ncpus=48:ompthreads=24
#PBS -j oe
#PBS -V
#PBS -N OpenMP

# Usage of this script:
#qsub -N jobname job-orca.sh  , where jobname is the name of your ORCA inputfile (jobname.inp) without the .inp extension

# Jobname below is set automatically when using "qsub -N jobname job-orca.sh ". Can alternatively be set manually here. Should be the name of the inputfile without extension (.inp or whatever).
export job=$PBS_JOBNAME

#Loading OPENMPI here:
module load openmpi-gnu/3.1.2

# Here giving the path to the ORCA binaries and giving communication protocol
export orcadir=/home/users/henriquecsj/bin/orca
export RSH_COMMAND="/usr/bin/ssh -x"
export PATH=$orcadir:$PATH

# Creating local scratch folder for the user on the computing node. /scratch directory must exist. 
if [ ! -d /scratch/60002a/$USER ]
then
  mkdir -p /scratch/60002a/$USER
fi
tdir=$(mktemp -d /scratch/60002a/$USER/orcajob__$PBS_JOBID-XXXX)

# Copy only the necessary files for ORCA from submit directory to scratch directory: inputfile, xyz-files, GBW-file etc.
# Add more here if needed.
cp $PBS_O_WORKDIR/*.inp $tdir/
cp $PBS_O_WORKDIR/*.gbw $tdir/
cp $PBS_O_WORKDIR/*.xyz $tdir/
cp $PBS_O_WORKDIR/*.hess $tdir/
cp $PBS_O_WORKDIR/*.pc $tdir/


# Creating nodefile in scratch
cat ${PBS_NODEFILE} > $tdir/$job.nodes

# cd to scratch
cd $tdir

# Copy job and node info to beginning of outputfile
echo "Job execution start: $(date)" >> $PBS_O_WORKDIR/$job.out
echo "Shared library path: $LD_LIBRARY_PATH" >> $PBS_O_WORKDIR/$job.out
echo "PBS Job ID is: ${PBS_JOBID}" >> $PBS_O_WORKDIR/$job.out
echo "PBS Job name is: ${PBS_JOBNAME}" >> $PBS_O_WORKDIR/$job.out
cat $PBS_NODEFILE >> $PBS_O_WORKDIR/$job.out

#Start ORCA job. ORCA is started using full pathname (necessary for parallel execution). Output file is written directly to submit directory on frontnode.
$orcadir/orca $tdir/$job.inp >> $PBS_O_WORKDIR/$job.out

# ORCA has finished here. Now copy important stuff back (xyz files, GBW files etc.). Add more here if needed.
cp $tdir/*.gbw $PBS_O_WORKDIR
cp $tdir/*.engrad $PBS_O_WORKDIR
cp $tdir/*.xyz $PBS_O_WORKDIR
cp $tdir/*.loc $PBS_O_WORKDIR
cp $tdir/*.qro $PBS_O_WORKDIR
cp $tdir/*.uno $PBS_O_WORKDIR
cp $tdir/*.unso $PBS_O_WORKDIR
cp $tdir/*.uco $PBS_O_WORKDIR
cp $tdir/*.hess $PBS_O_WORKDIR
cp $tdir/*.cis $PBS_O_WORKDIR
cp $tdir/*.dat $PBS_O_WORKDIR
cp $tdir/*.mp2nat $PBS_O_WORKDIR
cp $tdir/*.nat $PBS_O_WORKDIR
cp $tdir/*.scfp_fod $PBS_O_WORKDIR
cp $tdir/*.scfp $PBS_O_WORKDIR
cp $tdir/*.scfr $PBS_O_WORKDIR
cp $tdir/*.nbo $PBS_O_WORKDIR
cp $tdir/FILE.47 $PBS_O_WORKDIR
cp $tdir/*_property.txt $PBS_O_WORKDIR
cp $tdir/*spin* $PBS_O_WORKDIR

I can see by the logs that the jobs are propagating to the nodes, copying files and dying because they can’t correctly run mpirun as needed. Here is the log from the nodes:

cp: cannot stat ‘/home/users/henriquecsj/Co/SP/*.hess’: No such file or directory
cp: cannot stat ‘/home/users/henriquecsj/Co/SP/*.pc’: No such file or directory
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 24 slots
that were requested by the application:
  /home/users/henriquecsj/bin/orca/orca_gtoint_mpi

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
[file orca_tools/qcmsg.cpp, line 458]:
  .... aborting the run

And here is the main output:

Checking for AutoStart:
The File: /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.gbw exists
Trying to determine its content:
     ... Fine, the file contains calculation information
     ... Fine, the calculation information was read
     ... Fine, the file contains a basis set
     ... Fine, the basis set was read
     ... Fine, the file contains a geometry
     ... Fine, the geometry was read
     ... The file does not contain orbitals - skipping AutoStart

ORCA finished by error termination in GTOInt
Calling Command: mpirun -np 24  -machinefile /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.nodes /home/users/henriquecsj/bin/orca/o
rca_gtoint_mpi /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.int.tmp /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3

[file orca_tools/qcmsg.cpp, line 458]:
  .... aborting the run

Because we are totally blind here, could anyone provide a clue on what is happening?

P.S.: My input is asking for nprocs 48

adarsh · May 28, 2020, 9:11pm

Please share the PBS Version ( qstat --version )
Was OpenMPI compiled from source with PBS TM API
TM api on pbspro does not work for me?
Compile OpenMPI with PBSpro 14.1.10
Could you please simple OpenMPI + PBS job , before trying out your above example, to make sure it works.

adarsh · May 28, 2020, 10:35pm

HenriqueCSJ:

 mpirun -np 24  -machinefile /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.nodes /home/users/henriquecsj/bin/orca/o
rca_gtoint_mpi /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.int.tmp /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3

HenriqueCSJ:

There are not enough slots available in the system to satisfy the 24 slots
that were requested by the application:
  /home/users/henriquecsj/bin/orca/orca_gtoint_mpi

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

use mpirun --oversubscribe

mpirun --oversubscribe  -np 24  -machinefile /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.nodes /home/users/henriquecsj/bin/orca/o
rca_gtoint_mpi /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3.int.tmp /scratch/60002a/henriquecsj/orcajob__224634.service1-U8cY/CoL3

HenriqueCSJ · May 29, 2020, 12:51pm

Thank you so much for your help, guys.
I was able to run my job using

#PBS -l select=1:ncpus=24:mpiprocs=24

Now I’m trying to figure out how to request more threads for my calculations. If I want to request 48 threads which one is the correct option?

#PBS -l select=2:ncpus=48:mpiprocs=24

or

#PBS -l select=2:ncpus=24:mpiprocs=24

Thanks, guys.

Topic		Replies	Views
PBS job submission problem Users/Site Administrators	2	659	August 15, 2023
Facing Some Issue with Submitting Jobs on OpenPBS Cluster! Developers	1	43	January 21, 2025
OpenPBS CentOS 7 Job Submittion Users/Site Administrators	23	1813	March 30, 2021
Setting up openPBS on only one computer Users/Site Administrators	16	3776	October 27, 2020
Submitting Array Job with -t vs -J Users/Site Administrators	6	2279	February 2, 2021

Struggling to submite a (ORCA computational chemistry) parallel job to PBS Pro

Related topics