I am trying to run an “embarrassingly parallel” job in my organization’s HPC that uses PBS Pro as the job scheduler.
rpm -qa | grep pbs
pbspro-devel-2022.1.3.20230614134139-0.el8.x86_64
pbspro-client-2022.1.3.20230614134139-0.el8.x86_64
One node has 128 cores, and I almost always need more than 1 node to get meaningful results. The job requires sending one job per core by making slight modification (different starting random number) to the input file. So, if I need 256 jobs to run, I will have 256 input files preprepared in the directory.
The submission script is this.
#!/bin/bash
if [ $# -ne 4 ] ;then
echo " Usage: run.sh jobname NumberOfTasks Number of cycles walltime(hh:mm:ss) "
exit
fi
jobname=$1
n_files=$2
cycle=$3
walltime=$4
cpus_per_node=128
n_nodes=$(((${n_files}+cpus_per_node-1)/cpus_per_node))
Batchf="${jobname}.batch"
cat <<EOF >$Batchf
#!/bin/bash
#PBS -A SomeName
#PBS -M email
#PBS -m abe
#PBS -l select=${n_nodes}:ncpus=${cpus_per_node}
#PBS -l walltime=${walltime}
#PBS -N ${jobname}
cd $PWD
EOF
for ((i=1; i<=$n_files; i++)); do
xx=$(printf "%02d" $i)
input="${jobname}_${xx}.inp"
if [ ! -e "$input" ]; then
echo "Input file $input does not exist!"
rm $Batchf
exit
fi
echo “myexecutable -M ${cycle} ${input} &> ${jobname}_${xx}.stderrout &" >>$Batchf
done
echo "wait" >> $Batchf
qsub $Batchf
Take a case where I want to run 256 jobs on 2 nodes, but this could very well be 128*n jobs distributed over “n” nodes. In my input I can specify the run time and the jobs always exit cleanly within that time. In fact, with the above script, 256 jobs run well and exit cleanly if I am utilizing run time and wall time of 3 hours or less. If I ask for say 5 hours in my input, the jobs exceed the time limit and are killed even if I ask for 7 hours of wall time. I tired this with just one specific input, so I am not sure if applies to other jobs as well.
For a 256 core job with runtime of 5 hours and a wall time of 7 hours, qstat shows this.
I was told that I have to ssh into each node and submit the jobs, because the way I am submitting it now will result in all 256 jobs assigned to the first node, and that could possibly be the reason for the job kill. I did a pbsnodes on i820 which is one the nodes in the qstat output, and I got this. There are 128 occurrences of the job name in the outputs for both nodes. I removed several lines that are repetitive to make it easier to read.
i820
Mom = i820.xxx
ntype = PBS
state = job-exclusive
pcpus = 128
jobs = 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0,
………………………………………………… repeated, several lines removed……………………………………..
200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0,
200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0
resources_available.arch = linux
resources_available.host = i820
resources_available.mem = 1055890236kb
resources_available.ncpus = 128
resources_available.switch = r14u32
resources_available.vnode = i820
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 128
resources_assigned.vmem = 0kb
queue = bigmem
resv_enable = True
sharing = force_exclhost
license = l
last_state_change_time = Thu Jun 27 10:19:42 2024
last_used_time = Wed Jun 26 12:59:17 2024
The cluster is 2x AMD EPYC 7713 64-Core Processor with 128 cores per node. General queue nodes have 2GB per core, but I also tried, unsuccessfully, with the large memory queue as well (8GB per core).
I have used this technique with slurm for the past 8 years and never had any of these troubles, and I know one cluster had 4GB/core memory.
I very much appreciate help and hints for the following.
- Will my above script submit jobs across nodes, one job per core, and will it do so for more than one node? If the answer is yes, is there something that will prevent this, say some local configuration that is put in place the system administrators? If not, what’s the best way to utilize all cores, one job per core over several nodes?
- Do I have to ssh into each node to submit the job? if yes, may I have a few hints on how to do it?
- I would like to investigate why my longer jobs are exceeding the wall time and getting killed while the short ones run and exits cleanly. How can i do this? Where can I see why this is happening?
- How can I check to see the jobs are indeed running in all the nodes? Will a simple ‘ssh node’ tell? (Edit: ssh and top?)
- Why is the same technique working with slurm but not with PBS?
- I read about pbsdsh and Job Arrays. Are those better options for my job? If yes, may I have some hints on how to do this as well?
Many thanks, all help is greatly appreciated.