I’m trying to run a job (quantum chemistry, ORCA) on my cluster with 1 management node and 4 calculation nodes.
I read through the following very informative topic.
And, I successfully ran the stress script.
#PBS -q large
#PBS -l select=2:ncpus=16:mpiprocs=16
#PBS -l place=scatter
cd $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
echo "total_cores=$total_cores"
total_hosts=`cat $PBS_NODEFILE | uniq | wc -l`
echo "total_hosts=$total_hosts"
cores_per_host=$((total_cores / total_hosts))
echo "cores_per_host=$cores_per_host"
echo "running stress"
echo "/usr/local/bin/pbsdsh -- stress --cpu $cores_per_host --timeout 60s"
/usr/local/bin/pbsdsh -- stress --cpu $cores_per_host --timeout 60s
echo "ending stress"
Here is the output. It looks fine.
total_cores=32
total_hosts=2
cores_per_host=16
running stress
/opt/pbs/bin/pbsdsh -- stress --cpu 16 --timeout 100s
stress: info: [12621] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12622] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12626] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12640] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12656] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12671] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12688] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12704] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12719] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12734] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12750] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12769] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12779] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12827] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151607] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151608] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12844] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151609] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12777] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151623] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151644] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151660] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151677] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151694] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151711] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151745] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151724] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151762] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151795] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151812] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151829] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [151846] dispatching hogs: 16 cpu, 0 io, 0 vm, 0 hdd
stress: info: [12640] successful run completed in 60s
stress: info: [12656] successful run completed in 60s
stress: info: [12704] successful run completed in 60s
stress: info: [12622] successful run completed in 60s
stress: info: [12671] successful run completed in 60s
stress: info: [12621] successful run completed in 60s
stress: info: [12688] successful run completed in 60s
stress: info: [12719] successful run completed in 60s
stress: info: [12626] successful run completed in 60s
stress: info: [12734] successful run completed in 60s
stress: info: [12779] successful run completed in 60s
stress: info: [12827] successful run completed in 60s
stress: info: [151608] successful run completed in 60s
stress: info: [151677] successful run completed in 60s
stress: info: [151694] successful run completed in 60s
stress: info: [151607] successful run completed in 60s
stress: info: [151623] successful run completed in 60s
stress: info: [151609] successful run completed in 60s
stress: info: [151644] successful run completed in 60s
stress: info: [151660] successful run completed in 60s
stress: info: [12844] successful run completed in 60s
stress: info: [12777] successful run completed in 60s
stress: info: [12750] successful run completed in 60s
stress: info: [12769] successful run completed in 60s
stress: info: [151711] successful run completed in 60s
stress: info: [151745] successful run completed in 60s
stress: info: [151724] successful run completed in 60s
stress: info: [151762] successful run completed in 60s
stress: info: [151795] successful run completed in 60s
stress: info: [151812] successful run completed in 60s
stress: info: [151829] successful run completed in 60s
stress: info: [151846] successful run completed in 60s
ending stress
Next, I tried to run a ORCA job with this input file and also successfully ran the job.
#!/bin/sh
#PBS -q small
#PBS -l select=1:ncpus=16:mpiprocs=16
# Start ORCA Settings
ORCA=/path/to/orca
PATH=$ORCA:$PATH
LD_LIBRARY_PATH=$ORCA:$LD_LIBRARY_PATH
export PATH LD_LIBRARY_PATH
# End ORCA Settings
DIRNAME=`basename $PBS_O_WORKDIR`
WORKDIR=/tmp/$USER/$PBS_JOBID
mkdir -p $WORKDIR
cp -raf $PBS_O_WORKDIR $WORKDIR
cd $WORKDIR/$DIRNAME
export JOBNAME=job
/path/to/orca/orca $JOBNAME.inp > $JOBNAME.out
cd
if cp -raf $WORKDIR/$DIRNAME $PBS_O_WORKDIR/.. ; then
rm -rf $WORKDIR
fi
Then, I tried to run a ORCA job with multiple nodes like below.
#!/bin/sh
#PBS -q large
#PBS -l select=2:ncpus=16:mpiprocs=16
#PBS -l place=scatter
# Start ORCA Settings
ORCA=/path/to/orca
PATH=$ORCA:$PATH
LD_LIBRARY_PATH=$ORCA:$LD_LIBRARY_PATH
export PATH LD_LIBRARY_PATH
# End ORCA Settings
DIRNAME=`basename $PBS_O_WORKDIR`
WORKDIR=/tmp/$USER/$PBS_JOBID
mkdir -p $WORKDIR
cp -raf $PBS_O_WORKDIR $WORKDIR
cd $WORKDIR/$DIRNAME
export JOBNAME=FePc
/path/to/orca/orca $JOBNAME.inp "--oversubscribe" > $JOBNAME.out
cd
if cp -raf $WORKDIR/$DIRNAME $PBS_O_WORKDIR/.. ; then
rm -rf $WORKDIR
fi
During the run.
Job Id: 2095.peach
Job_Name = job.sh
resources_used.cpupercent = 1582
resources_used.cput = 02:39:14
resources_used.mem = 21201068kb
resources_used.ncpus = 32
resources_used.vmem = 47555576kb
resources_used.walltime = 00:09:59
job_state = R
queue = large
server = peach
Checkpoint = u
ctime = Thu Mar 14 10:56:01 2024
exec_host = peach02/0*16+peach01/0*16
exec_vnode = (peach02:ncpus=16)+(peach01:ncpus=16)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Thu Mar 14 11:06:03 2024
Priority = 0
qtime = Thu Mar 14 10:56:01 2024
Rerunable = True
Resource_List.mpiprocs = 32
Resource_List.ncpus = 32
Resource_List.nodect = 2
Resource_List.nodes = 4
Resource_List.place = scatter
Resource_List.select = 2:ncpus=16:mpiprocs=16
stime = Thu Mar 14 10:56:01 2024
session_id = 12898
substate = 42
Variable_List = PBS_O_HOME=/home/user,PBS_O_LANG=ja_JP.UTF-8,
PBS_O_LOGNAME=user,
PBS_O_PATH=/home/user/bin:/home/user/bin/orca:/usr/local/bin:/us
r/bin:/bin:/usr/local/games:/usr/games:/usr/local/bin,
PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/home/user/orca/test/FePc2,
PBS_O_SYSTEM=Linux,PBS_O_QUEUE=large,
comment = Job run at Thu Mar 14 at 01:56 on (peach02:ncpus=16)+(peach01:ncp
us=16)
etime = Thu Mar 14 10:56:01 2024
run_count = 1
Submit_arguments = job.sh
project = _pbs_project_default
However, it looks 32 processes run on peach02 and it was not distributed to peach01.
This may ORCA specific issue, but I do appreciate your help.