Jobs maybe running in one node, possible reason for getting killed

I am trying to run an “embarrassingly parallel” job in my organization’s HPC that uses PBS Pro as the job scheduler.

rpm -qa | grep pbs
pbspro-devel-2022.1.3.20230614134139-0.el8.x86_64
pbspro-client-2022.1.3.20230614134139-0.el8.x86_64

One node has 128 cores, and I almost always need more than 1 node to get meaningful results. The job requires sending one job per core by making slight modification (different starting random number) to the input file. So, if I need 256 jobs to run, I will have 256 input files preprepared in the directory.
The submission script is this.

#!/bin/bash
if [ $# -ne 4 ] ;then
    echo " Usage: run.sh jobname NumberOfTasks Number of cycles walltime(hh:mm:ss) "
    exit
fi
jobname=$1
n_files=$2
cycle=$3
walltime=$4
cpus_per_node=128
n_nodes=$(((${n_files}+cpus_per_node-1)/cpus_per_node))

Batchf="${jobname}.batch"

cat <<EOF >$Batchf
#!/bin/bash
#PBS -A SomeName
#PBS -M email
#PBS -m abe
#PBS -l select=${n_nodes}:ncpus=${cpus_per_node}
#PBS -l walltime=${walltime}
#PBS -N ${jobname}
cd $PWD
EOF

for ((i=1; i<=$n_files; i++)); do
    xx=$(printf "%02d" $i)
    input="${jobname}_${xx}.inp"
    if [ ! -e "$input" ]; then
        echo "Input file $input does not exist!"
        rm $Batchf
        exit
        fi
echo “myexecutable -M ${cycle} ${input}  &> ${jobname}_${xx}.stderrout &" >>$Batchf
done
echo "wait" >> $Batchf
qsub $Batchf

Take a case where I want to run 256 jobs on 2 nodes, but this could very well be 128*n jobs distributed over “n” nodes. In my input I can specify the run time and the jobs always exit cleanly within that time. In fact, with the above script, 256 jobs run well and exit cleanly if I am utilizing run time and wall time of 3 hours or less. If I ask for say 5 hours in my input, the jobs exceed the time limit and are killed even if I ask for 7 hours of wall time. I tired this with just one specific input, so I am not sure if applies to other jobs as well.
For a 256 core job with runtime of 5 hours and a wall time of 7 hours, qstat shows this.

I was told that I have to ssh into each node and submit the jobs, because the way I am submitting it now will result in all 256 jobs assigned to the first node, and that could possibly be the reason for the job kill. I did a pbsnodes on i820 which is one the nodes in the qstat output, and I got this. There are 128 occurrences of the job name in the outputs for both nodes. I removed several lines that are repetitive to make it easier to read.

i820
     Mom = i820.xxx
     ntype = PBS
     state = job-exclusive
     pcpus = 128
     jobs = 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 
………………………………………………… repeated, several lines removed……………………………………..
200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 
200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0, 200160.imgt1/0
     resources_available.arch = linux
     resources_available.host = i820
     resources_available.mem = 1055890236kb
     resources_available.ncpus = 128
     resources_available.switch = r14u32
     resources_available.vnode = i820
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 128
     resources_assigned.vmem = 0kb
     queue = bigmem
     resv_enable = True
     sharing = force_exclhost
     license = l
     last_state_change_time = Thu Jun 27 10:19:42 2024
     last_used_time = Wed Jun 26 12:59:17 2024

The cluster is 2x AMD EPYC 7713 64-Core Processor with 128 cores per node. General queue nodes have 2GB per core, but I also tried, unsuccessfully, with the large memory queue as well (8GB per core).
I have used this technique with slurm for the past 8 years and never had any of these troubles, and I know one cluster had 4GB/core memory.

I very much appreciate help and hints for the following.

  1. Will my above script submit jobs across nodes, one job per core, and will it do so for more than one node? If the answer is yes, is there something that will prevent this, say some local configuration that is put in place the system administrators? If not, what’s the best way to utilize all cores, one job per core over several nodes?
  2. Do I have to ssh into each node to submit the job? if yes, may I have a few hints on how to do it?
  3. I would like to investigate why my longer jobs are exceeding the wall time and getting killed while the short ones run and exits cleanly. How can i do this? Where can I see why this is happening?
  4. How can I check to see the jobs are indeed running in all the nodes? Will a simple ‘ssh node’ tell? (Edit: ssh and top?)
  5. Why is the same technique working with slurm but not with PBS?
  6. I read about pbsdsh and Job Arrays. Are those better options for my job? If yes, may I have some hints on how to do this as well?
    Many thanks, all help is greatly appreciated.

Please try this

example:
#Creating 256 input files
for i in {1..256};do echo $i > inputfile_$i.txt ; done

Try this job array script:, that reads inputfile_

cat pbsjobarray.sh

#!/bin/bash
#PBS -N jobarray_256_core_job
#PBS -l select=1:ncpus=1
#PBS -J 1-256
hostname
env
cd $PBS_O_WORKDIR
/bin/echo inputfile_$PBS_ARRAY_INDEX
exit 0

qsub pbsjobarray.sh

Q1: To run 1 core per job , the qsub statement should be
` qsub -l select=1:ncpus=1:mem=10mb – /bin/sleep 100

Either you can run 256  one core job like this 
 for i in {1..256}; do qsub -l select=1:ncpus=1:mem=10mb -- /bin/echo inputfilename_$i ; done 

Or you can use job array submission as per the example above`

Q2. No, you do not have to ssh into each node to submit the job. You can submit the job on the headnode or master node or on the host that has PBS Server running.

Q3: Please check the mom_logs and within the mom logs trace the jobid that got killed for exceeding the walltime ( ssh computenode; source /etc/pbs.conf ; cd $PBS_HOME/mom_logs)

Q4: you can check by running this command

qstat -answ1
qstat -t [ in case of job array]
pbsnodes -aSjv

Q5: Should work with all the supported work load managers
Q6 : Job Arrays would best suite your requirement.

Please check the documentation for more information:https://help.altair.com/2022.1.0/PBS%20Professional/PBS2022.1.pdf

The walltime syntax is HH:MM:SS,
To submit a job with a walltime of 10 minutes

#PBS -l walltime=00:10:00

To submit a job with wall time of 10 hours

#PBS -l walltime=10:00:00

To submit a job with walltime of 10 seconds

#PBS -l walltime=00:00:10

If you submit a job with the below specification
#PBS -l walltime=10

Then you are requesting walltime of 10 seconds

 qstat -f | grep -i walltime
    resources_used.walltime = 00:00:00
    Resource_List.walltime = 00:00:10
    Submit_arguments = -l select=1:ncpus=1 -l walltime=10 -- /bin/sleep 100

If you are requesting walltime with the below specification
#PBS -l walltime=100

You are requesting walltime of 1minute and 40 seconds for the job.

qstat -f | grep -i walltime
    Resource_List.walltime = 00:01:40
    Submit_arguments = -l select=1:ncpus=1 -l walltime=100 -- /bin/sleep 100

Dear Adarsh
Thank you very much for your prompt and detailed reply. I will try and let you know.
Cheers, Sunil

1 Like

Dear Adarsh,
The wall time is read as hh:mm:ss, input from the terminal.

1 Like

Hi Adarsh,
The array technique works but if I use 128 subjobs, I exceed the maximum number of jobs allowed (100) per user because each sub job is counted as one job.

The second technique (answer to Q1) also meets the same fate since each qsub statement submits one job per node. This also exceeds the maximum job limit. Also, to the best of my knowledge, it submits only one job per node because of the way the allocation is set up, in nodes and not in cores.
What I was looking for is one job submission with multiple child jobs across various cores in one or more nodes. The system should see it as one job.

Currently the only way I am successful is by creating one batch file per node, with 128 execute statements. This submits 128 jobs per node, but each node job is counted only as one.
Let me know If I am missing something.
Thanks
Sunil

Perhaps GNU Parallel would help. You submit one job, but within that job you use GNU Parallel to spread your work out to multiple cores/nodes.

https://www.gnu.org/software/parallel/

Hi @dtalcott
Thanks for the suggestion. However, GNU parallel is not installed (its a organization HPC) and I don’t think they will do it for me. I am constrained to find a solution with PBS.