Job not getting distributed among nodes

I am testing pbs. Currently the setup is 1 head node and 2 Nodes
Centos 7
PBS 20.0.1

In the PBS script when I give parameter -
#PBS -l nodes=2:ppn=1
I am assuming it means 1 process on both nodes.

After submitting job also when I run tracejob it shows properly -
Job Run at request of Scheduler@pb0 on exec_vnode (pb1:ncpus=1:mem=524288kb)+(pb2:ncpus=1:mem=524288kb)

But in reality, it only runs on one node. What could be the issue?

In $PBS_HOME/sched_priv/sched_config
I have kept -
load_balancing: true ALL

Also when adding a node when I give the attribute for example

create node nodename ntype=time-shared

It does not take ntype value.

Please try this

#PBS -l select=2:ncpus=1
#PBS -l place=scatter


Hi Adarsh, thanks for your reply.

Below is the error when I try to submit job with those.

qsub: “-lresource=” cannot be used with “select” or “place”, resource is: mem

So I removed the memory parameter from below script and submitted it again

My script is -

#!/bin/bash

staring with # is a comment and # PBS is a pbs parameter

#Give a name to your Job
#PBS -N mpiruns
#Give the output file name
#PBS -o mpiruns.o.txt
#Give the error file name
#PBS -e mpiruns.e.txt
#Give the Queue name.
#PBS -q all
#Give the nodes. This is default. it will submit accros 10 nodes as per cores asked. Per node is 16 core.
#PBS -l select=2:ncpus=1
#PBS -l place=scatter
#Load Default Enviromnet
#PBS -V
#Mention your email to get notified once the job is done.
#PBS -m abe
#PBS -M myemai@abc.com
#Memory needed for your code
#PBS -l mem=1024mb

Specify time required for your runs. Below example is 1 hours of CPU Time. Please don’t block job for over 24 hours.

#PBS -l cput=00:03:00

Give your run as below

mpirun $HOME/mpi


Tracejob output -

Considering job to run
05/20/2022 02:07:43 S Job Queued at request of vinay@pb0, owner = vinay@pb0, job name = mpiruns, queue = all
05/20/2022 02:07:43 S Job Run at request of Scheduler@pb0 on exec_vnode (pb1:ncpus=1)+(pb2:ncpus=1)
05/20/2022 02:07:43 L Job run


But it still ran only on pb1 (that is only one node)


I am running a basic mpi script which just prints the hostname


MPI code -

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv)
{
int rank, size, h_len;
char hostname[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);

// get rank of this proces
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// get total process number
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  MPI_Get_processor_name(hostname, &h_len);
  printf("Start! rank:%d size: %d at %s\n", rank, size,hostname);
 //do something
 printf("Done!  rank:%d size: %d at %s\n", rank, size,hostname);

 MPI_Finalize();
 return 0;

}


Output what I am getting is -

Start! rank:0 size: 8 at pb1
Done! rank:0 size: 8 at pb1
Start! rank:1 size: 8 at pb1
Done! rank:1 size: 8 at pb1
Start! rank:2 size: 8 at pb1
Done! rank:2 size: 8 at pb1
Start! rank:3 size: 8 at pb1
Done! rank:3 size: 8 at pb1
Start! rank:4 size: 8 at pb1
Done! rank:4 size: 8 at pb1
Start! rank:5 size: 8 at pb1
Done! rank:5 size: 8 at pb1
Start! rank:6 size: 8 at pb1
Done! rank:6 size: 8 at pb1
Start! rank:7 size: 8 at pb1
Done! rank:7 size: 8 at pb1


It should also show pb2 , Also ncpus=1 so it should take only one core. but it runs on 8 as you can see the output.


Is there any configuration on the headnode I am missing.?


pbsnodes -a output –

pb1
Mom = pb1
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = pb1
resources_available.mem = 3970436kb
resources_available.ncpus = 8
resources_available.vnode = pb1
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed May 4 03:58:53 2022
last_used_time = Fri May 20 02:07:45 2022

pb2
Mom = pb2
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = pb2
resources_available.mem = 3970436kb
resources_available.ncpus = 8
resources_available.vnode = pb2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed May 4 03:58:53 2022
last_used_time = Fri May 20 02:07:45 2022

Memory is a host level consumable resource, so it should be part of the chunk or select statuement.
Please try this block

#PBS -N mpiruns
#PBS -o mpiruns.o.txt
#PBS -e mpiruns.e.txt
#PBS -q all
#PBS -l select=2:ncpus=1:mem=1024mb
#PBS -l place=scatter
#PBS -V
#PBS -m abe
#PBS -M myemai@abc.com

cd  $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
/absolute/path/to/mpirun   -np  $total_cores -hosts  $PBS_NODEFILE   /path/to/your/application <arguments if it needs any>

  1. DNS and /etc/hosts should resolve the hosts
  2. password-less ssh for the user(s) should be working fine (seamlessly) headnod to compute nodes , compute nodes to compute nodes and compute nodes to headnode

Please let us know the flavour of MPI you are using ( Intel MPI or OpenMPI or HP-MPI)

Hi Adarsh,

/etc/hosts are correct and password less ssh is working between compute nodes and head nodes.

I am using openmpi

Even if i run a normal command of hostname in script and submit via qsub to run on both nodes the output will shows only one node.

It never distributes .

Installation steps -

I made rpm via source code of openpbs 20

Installed server rpm on head node
Execution rpm on compute nodes

And the normal config what we do for PBS

Enabled load balancing true

Added a queue and added both nodes.

Is there any step missing or u have any document which can be referred

What ever gudies available on internet i tried everything.

I found ntype value can be set to time-shared

But was never able to do it.

Hi Vinay,

I hope there is a shared folder that is common to both the nodes.
Can you run the mpi jobs without using PBS on the two nodes:
Please check the below link and FAQs fro 3 to 7
https://www.open-mpi.org/faq/?category=running

If this works and if you have compiled openmpi with PBS TM libraries , then there should not be any issues in running across n number of nodes.

Hi Adarsh,

Thank you for replying. I understand what you are saying. The issue here is not about mpi.

For your question - yes I am able to run mpi job without pbs on both nodes.

What I am mentioning from start is jobs not getting distributed among nodes. I never said it is not running on nodes.

So, to make it more clear. If i remove node 1, delete node pb1 via qmgr and we have now only one node available i.e. pb2

And I submit a job it runs fine on pb2 too.

So there is no question that pbs is either able to push jobs on different nodes or not.

If you try to understand my question carefully -

headnode - pb0
Compute nodes - pb1 , pb2
This
#PBS -l nodes=2:ppn=1
Or
This
#PBS -l select=1:ncpus=4
#PBS -l place=scatter


It never goes to another node. Whichever 1st node it finds in pbsnodes it submits in that. Despite pbsnodes -a showing all the added nodes with proper cpu resources as free.


So both my nodes are 8 cores. Even if i increase the job resource to more than 8 ncpus=12
The job will never run.

qsub is never able to submit or see more than one node.


I hope you understand the issue what I am facing right now.

It is not related to mpi.

Is the correct one /

Could you please submit this qsub request and let me know the qstat -answ1 output

qsub -l select=2:ncpus=1 -l place=scatter -- /bin/sleep 1000

Create a scripts with the below contents

pbs1.sh

#PBS  -l select=2:ncpus=1
#PBS -l place=scatter
echo $PBS_NODEFILE
cat $PBS_NODEFILE
env

qsub pbs1.sh
cat .o

and another script
pbs2.sh

#PBS  -l select=2:ncpus=2:mpiprocs=2
#PBS -l place=scatter
echo $PBS_NODEFILE
cat $PBS_NODEFILE
env

qsub pbs2.sh
cat .o

Job not getting distributed among nodes:

qsub -l select=2:ncpus=1
#you are request 2 chunks each with a request of 1 core, this might run on one node if 2 cores are free , or it might run on two nodes.

If you have two nodes
n1 = 4 cores
n2 = 4 cores
with the above request the job will run on n1 (if none of the cores of n1 and n2 are used up and n1 , n2 is the order of the pbsnodes -av output)

qsub -l select=2:ncpus=1 -l place=scatter
#here we are requesting the same , but making sure the chunks is scatter , that means the 2 chunks should not run on one node .

If you have two nodes
n1 = 4 cores
n2 = 4 cores
with the above request the job will use 1 core from n1 and 1 core from n2 .
If n1 has all the cores used up by some other job(s), then this request will remain in queue until resources are available.

Dear Adarsh,

I have shared the output of both scripts as request by you. please check -

Dear Adarsh,

Issue is quite strange. If you can spare half an hour , we can have a google meet sessions where I can share you my screen.

As these are Virtual machines, I can set it back to basic, and reinstall pbs and submit job and test which should not take more than 15 mins.

Only if you are free. please let me know the time. Will share the meet link.

Thank you Vinay.

The jobs you ran looks good to me , there is nothing wrong in the scheduler configuration and job susbmission.

/var/spool/pbs/aux/28.pb0
pb2
pb1

/var/spool/pbs/aux/29.pb0
pb2
pb2
pb1
pb1

#PBS -N mpiruns
#PBS -o mpiruns.o.txt
#PBS -e mpiruns.e.txt
#PBS -q all
#PBS -l select=2:ncpus=1:mem=1024mb
#PBS -l place=scatter
#PBS -V
#PBS -m abe
#PBS -M myemai@abc.com

cd  $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
/absolute/path/to/mpirun   -np  2 -hosts  $PBS_NODEFILE  /bin/hostname

Hi Adarsh,

Thank you for replying. That is the issue. in tracejob also I see it has been distributed.
But actually, the job gets executed only on the 1st node.

even if my job is just submitting hostname or an mpi or any python job. We tried various things.

But the job gets submitted only on one node.

Even when we write place=scatter

PBS Pro will allocate the resources (compute nodes - cpu, mem, etc) for a job and dynamically creates the host file ($PBS_NODEFILE). It is then upto the underlaying application to use this host information to run across the multiple cores of a single machine or multiple cores of multiple nodes.

  • if the MPI is tightly integrated with PBS Pro i.e., compiled from source using PBS Prom TM libraries, then there will be proper accounting of the processes and management of the processes spawned by MPI.
  • Otherwise, if it is MPI binary used directly, then there might be some zombie processes left thta needs cleaning after the job has run. If in case the application has this inbuilt smart mechanism of killing the zombie, then it is well and good.

Please share us the batch command line that is used to run across 2 machines without using PBS . If this is not working then it would not work using PBS Pro as well.

Scatter is a way of splitting the chunks of resources to be used from separate compute nodes in the the PBS complex.

For example if you would like to run 100 core job , but use 10 cores each from 10 compute nodes
#PBS -l select=10:ncpus=10
#PBS -l place=scatter

Otherwise, if you have 1 compute node with 100 cores, then the below directive without place=scatter will run on one system:
#PBS -l select=10:ncpus=10

Hi Adarsh,

Lets leave mpi. But I dont agree that a system installed openmpi will change the behaviour of PBS submitting jobs.

In this way, there are lot of users who would be compiling lot of things and using their own version. PBS will behave differently for everyone.

Let’s concentrate on this issue, rather going in circles -

The document as well as you say place=scatter should distribute.

To get it more clear will not use mpi for making it simple to tell you -

I am submitting a stress command , which will create 4 stress jobs for 200 sec, you can check it on top /htop after running.

stress --cpu 4 --timeout 200

Below is my pbs script -

#!/bin/bash

#PBS -N ZIS_scon

#PBS -q all

#PBS -l select=2:ncpus=4

#PBS -l place=scatter

#PBS -V

#PBS -o mpiruns.o.txt

#Give the error file name

#PBS -e mpiruns.e.txt

cd $PBS_O_WORKDIR

cat $PBS_NODEFILE > pbs_nodes

echo Working directory is $PBS_O_WORKDIR

NPROCS=wc -l < $PBS_NODEFILE

NNODES=uniq $PBS_NODEFILE | wc -l

Display the job context

echo Running on host hostname

echo Time is date

echo Directory is pwd

echo Using ${NPROCS} processors across ${NNODES} nodes

stress --cpu 4 --timeout 200


After running this as per your saying, as we are using scatter -
node1 should also show stress jobs running as well as node2


but it doesn’t .

Am I misunderstanding the concept of PBS. or I am just not able to make you understand we both are just revolving around the same topic.

Job submission -

node 1 -

node2 -

As you can see there is nothing running on node 2. Even a simple command hostname which will print hostname on the output file. will clearly shows it runs only on one node.

Tracejob output -

It clearly shows it is submitting jobs on both nodes. So why I don’t see it running on node2.

I hope you understand it this time.

PBS Pro is a scheduler (work load manager) , it would not parallelize the application or batch command line to run on multiple nodes.

stress --cpu 4 --timeout 200 is a serial batch application and not distributed parallel applicaiton.

Hence PBS Pro would not run
stress -ncpus 4 --timeout 200 on compute node 1
stress -ncpus 4 --timeout 200 on compute node 2

Instead it woudl assign compute node1 and compute node 2 with the requested resources (via the qsub)

PBS Pro would schedule and assign the requested resources to the job (job wide). If the application is distributed parallel then it is given infromation in the form of $PBS_NODEFILE (with or without mpiprocs) which would be used by the underlying applicaiton to run the job

the stress application should be able to read the hosts assigned to the job from PBS Pro and able to divide and distribute it across the hosts or nodes dynamically provided by $PBS_NODEFILE

FYI: Batch environment: PBS - User Documentation - ECMWF Confluence Wiki

Assuming that “stress” is an MPI enabled application (calling MPI_Init, etc.) and compiled with mpicc so that it is linked properly to the MPI libraries, I think the piece your missing is “mpirun”. For example…

mpirun -np 4 --hostfile $PBS_NODEFILE stress --cpu 4 --timeout 200

mpirun man page is here: mpirun(1) man page (version 4.1.3)

1 Like

Please try this script (kind of induced parallelism)

#cat stress.sh

#PBS -N stress
#PBS -l select=2:ncpus=4:mpiprocs=4
cd $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
echo "total_cores=$total_cores"
total_hosts=`cat $PBS_NODEFILE | uniq | wc -l`
echo "total_hosts=$total_hosts"
cores_per_host=$((total_cores / total_hosts))
echo "cores_per_host=$cores_per_host"
echo "running stress"
echo "/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host  --timeout 100s"
/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host  --timeout 100s
echo "ending stress"

#qsub stress.sh

Dear Adarsh,

It still runs on one node.-

Below is the stress.sh

#PBS -q all

#PBS -N stress

#PBS -l select=2:ncpus=4:mpiprocs=4

#PBS -o stress.o.txt

##PBS -e stress.e.txt

cd $PBS_O_WORKDIR

total_cores=cat $PBS_NODEFILE | wc -l

echo “total_cores=$total_cores”

total_hosts=cat $PBS_NODEFILE | uniq | wc -l

echo “total_hosts=$total_hosts”

cores_per_host=$((total_cores / total_hosts))

echo “cores_per_host=$cores_per_host”

echo “running stress”

echo “/opt/pbs/bin/pbsdsh – stress --cpu $cores_per_host --timeout 100s”

/opt/pbs/bin/pbsdsh – stress --cpu $cores_per_host --timeout 100s

echo “ending stress”


TRACEJOB -

HTOP -

Node 1 -

Node 2 -


As you can see it is running on single node. but the tracejob does shows that it is spilitting.

I have tested the same script and it runs on two systems in parallel.
I am now sure whether you have StrictHostkeyCheck turned on the second node and it might causing issues. Please disable it in the sched_config on all the systems in the PBS Complex.

Hi Adarsh,

I can’t find the option “StrictHostkeyCheck” in /var/spool/pbs/sched_priv/sched_config file on headnode.

My configurations are default with no change. I have only edited the /etc/pbs.conf on headnode and compute nodes.

And /var/spool/pbs/mom_priv/config in which I have added the head node hostname.

pbs-server rpm is installed on headnode. and pbs-execution rpm is installed on compute nodes.

sched_config file for your reference - Upload files for free - sched_config - ufile.io