Job not getting distributed among nodes

vinay · May 19, 2022, 11:39am

I am testing pbs. Currently the setup is 1 head node and 2 Nodes
Centos 7
PBS 20.0.1

In the PBS script when I give parameter -
#PBS -l nodes=2:ppn=1
I am assuming it means 1 process on both nodes.

After submitting job also when I run tracejob it shows properly -
Job Run at request of Scheduler@pb0 on exec_vnode (pb1:ncpus=1:mem=524288kb)+(pb2:ncpus=1:mem=524288kb)

But in reality, it only runs on one node. What could be the issue?

In $PBS_HOME/sched_priv/sched_config
I have kept -
load_balancing: true ALL

Also when adding a node when I give the attribute for example

create node nodename ntype=time-shared

It does not take ntype value.

adarsh · May 19, 2022, 7:14pm

Please try this

#PBS -l select=2:ncpus=1
#PBS -l place=scatter

vinay · May 20, 2022, 2:14am

Hi Adarsh, thanks for your reply.

Below is the error when I try to submit job with those.

qsub: “-lresource=” cannot be used with “select” or “place”, resource is: mem

So I removed the memory parameter from below script and submitted it again

My script is -

#!/bin/bash

staring with # is a comment and # PBS is a pbs parameter

#Give a name to your Job
#PBS -N mpiruns
#Give the output file name
#PBS -o mpiruns.o.txt
#Give the error file name
#PBS -e mpiruns.e.txt
#Give the Queue name.
#PBS -q all
#Give the nodes. This is default. it will submit accros 10 nodes as per cores asked. Per node is 16 core.
#PBS -l select=2:ncpus=1
#PBS -l place=scatter
#Load Default Enviromnet
#PBS -V
#Mention your email to get notified once the job is done.
#PBS -m abe
#PBS -M myemai@abc.com
#Memory needed for your code
#PBS -l mem=1024mb

Specify time required for your runs. Below example is 1 hours of CPU Time. Please don’t block job for over 24 hours.

#PBS -l cput=00:03:00

Give your run as below

mpirun $HOME/mpi

Tracejob output -

Considering job to run
05/20/2022 02:07:43 S Job Queued at request of vinay@pb0, owner = vinay@pb0, job name = mpiruns, queue = all
05/20/2022 02:07:43 S Job Run at request of Scheduler@pb0 on exec_vnode (pb1:ncpus=1)+(pb2:ncpus=1)
05/20/2022 02:07:43 L Job run

But it still ran only on pb1 (that is only one node)

I am running a basic mpi script which just prints the hostname

MPI code -

#include <stdio.h>
#include <mpi.h>

int main(int argc, char *argv)
{
int rank, size, h_len;
char hostname[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv);

// get rank of this proces
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// get total process number
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  MPI_Get_processor_name(hostname, &h_len);
  printf("Start! rank:%d size: %d at %s\n", rank, size,hostname);
 //do something
 printf("Done!  rank:%d size: %d at %s\n", rank, size,hostname);

 MPI_Finalize();
 return 0;

}

Output what I am getting is -

Start! rank:0 size: 8 at pb1
Done! rank:0 size: 8 at pb1
Start! rank:1 size: 8 at pb1
Done! rank:1 size: 8 at pb1
Start! rank:2 size: 8 at pb1
Done! rank:2 size: 8 at pb1
Start! rank:3 size: 8 at pb1
Done! rank:3 size: 8 at pb1
Start! rank:4 size: 8 at pb1
Done! rank:4 size: 8 at pb1
Start! rank:5 size: 8 at pb1
Done! rank:5 size: 8 at pb1
Start! rank:6 size: 8 at pb1
Done! rank:6 size: 8 at pb1
Start! rank:7 size: 8 at pb1
Done! rank:7 size: 8 at pb1

It should also show pb2 , Also ncpus=1 so it should take only one core. but it runs on 8 as you can see the output.

Is there any configuration on the headnode I am missing.?

pbsnodes -a output –

pb1
Mom = pb1
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = pb1
resources_available.mem = 3970436kb
resources_available.ncpus = 8
resources_available.vnode = pb1
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed May 4 03:58:53 2022
last_used_time = Fri May 20 02:07:45 2022

pb2
Mom = pb2
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = pb2
resources_available.mem = 3970436kb
resources_available.ncpus = 8
resources_available.vnode = pb2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
last_state_change_time = Wed May 4 03:58:53 2022
last_used_time = Fri May 20 02:07:45 2022

adarsh · May 20, 2022, 6:24am

Memory is a host level consumable resource, so it should be part of the chunk or select statuement.
Please try this block

#PBS -N mpiruns
#PBS -o mpiruns.o.txt
#PBS -e mpiruns.e.txt
#PBS -q all
#PBS -l select=2:ncpus=1:mem=1024mb
#PBS -l place=scatter
#PBS -V
#PBS -m abe
#PBS -M myemai@abc.com

cd  $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
/absolute/path/to/mpirun   -np  $total_cores -hosts  $PBS_NODEFILE   /path/to/your/application <arguments if it needs any>

DNS and /etc/hosts should resolve the hosts
password-less ssh for the user(s) should be working fine (seamlessly) headnod to compute nodes , compute nodes to compute nodes and compute nodes to headnode

Please let us know the flavour of MPI you are using ( Intel MPI or OpenMPI or HP-MPI)

vinay · May 20, 2022, 8:05am

Hi Adarsh,

/etc/hosts are correct and password less ssh is working between compute nodes and head nodes.

I am using openmpi

Even if i run a normal command of hostname in script and submit via qsub to run on both nodes the output will shows only one node.

It never distributes .

Installation steps -

I made rpm via source code of openpbs 20

Installed server rpm on head node
Execution rpm on compute nodes

And the normal config what we do for PBS

Enabled load balancing true

Added a queue and added both nodes.

Is there any step missing or u have any document which can be referred

What ever gudies available on internet i tried everything.

I found ntype value can be set to time-shared

But was never able to do it.

adarsh · May 20, 2022, 11:47am

Hi Vinay,

I hope there is a shared folder that is common to both the nodes.
Can you run the mpi jobs without using PBS on the two nodes:
Please check the below link and FAQs fro 3 to 7
https://www.open-mpi.org/faq/?category=running

If this works and if you have compiled openmpi with PBS TM libraries , then there should not be any issues in running across n number of nodes.

vinay · May 22, 2022, 5:01am

Hi Adarsh,

Thank you for replying. I understand what you are saying. The issue here is not about mpi.

For your question - yes I am able to run mpi job without pbs on both nodes.

What I am mentioning from start is jobs not getting distributed among nodes. I never said it is not running on nodes.

So, to make it more clear. If i remove node 1, delete node pb1 via qmgr and we have now only one node available i.e. pb2

And I submit a job it runs fine on pb2 too.

So there is no question that pbs is either able to push jobs on different nodes or not.

If you try to understand my question carefully -

headnode - pb0
Compute nodes - pb1 , pb2
This
#PBS -l nodes=2:ppn=1
Or
This
#PBS -l select=1:ncpus=4
#PBS -l place=scatter

It never goes to another node. Whichever 1st node it finds in pbsnodes it submits in that. Despite pbsnodes -a showing all the added nodes with proper cpu resources as free.

So both my nodes are 8 cores. Even if i increase the job resource to more than 8 ncpus=12
The job will never run.

qsub is never able to submit or see more than one node.

I hope you understand the issue what I am facing right now.

It is not related to mpi.

adarsh · May 22, 2022, 7:15am

Is the correct one /

Could you please submit this qsub request and let me know the qstat -answ1 output

qsub -l select=2:ncpus=1 -l place=scatter -- /bin/sleep 1000

Create a scripts with the below contents

pbs1.sh

#PBS  -l select=2:ncpus=1
#PBS -l place=scatter
echo $PBS_NODEFILE
cat $PBS_NODEFILE
env

qsub pbs1.sh
cat .o

and another script
pbs2.sh

#PBS  -l select=2:ncpus=2:mpiprocs=2
#PBS -l place=scatter
echo $PBS_NODEFILE
cat $PBS_NODEFILE
env

qsub pbs2.sh
cat .o

Job not getting distributed among nodes:

qsub -l select=2:ncpus=1
#you are request 2 chunks each with a request of 1 core, this might run on one node if 2 cores are free , or it might run on two nodes.

If you have two nodes
n1 = 4 cores
n2 = 4 cores
with the above request the job will run on n1 (if none of the cores of n1 and n2 are used up and n1 , n2 is the order of the pbsnodes -av output)

qsub -l select=2:ncpus=1 -l place=scatter
#here we are requesting the same , but making sure the chunks is scatter , that means the 2 chunks should not run on one node .

If you have two nodes
n1 = 4 cores
n2 = 4 cores
with the above request the job will use 1 core from n1 and 1 core from n2 .
If n1 has all the cores used up by some other job(s), then this request will remain in queue until resources are available.

vinay · May 22, 2022, 7:35am

Dear Adarsh,

I have shared the output of both scripts as request by you. please check -

vinay · May 22, 2022, 7:46am

Dear Adarsh,

Issue is quite strange. If you can spare half an hour , we can have a google meet sessions where I can share you my screen.

As these are Virtual machines, I can set it back to basic, and reinstall pbs and submit job and test which should not take more than 15 mins.

Only if you are free. please let me know the time. Will share the meet link.

adarsh · May 23, 2022, 8:37am

Thank you Vinay.

The jobs you ran looks good to me , there is nothing wrong in the scheduler configuration and job susbmission.

/var/spool/pbs/aux/28.pb0
pb2
pb1

/var/spool/pbs/aux/29.pb0
pb2
pb2
pb1
pb1

#PBS -N mpiruns
#PBS -o mpiruns.o.txt
#PBS -e mpiruns.e.txt
#PBS -q all
#PBS -l select=2:ncpus=1:mem=1024mb
#PBS -l place=scatter
#PBS -V
#PBS -m abe
#PBS -M myemai@abc.com

cd  $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
/absolute/path/to/mpirun   -np  2 -hosts  $PBS_NODEFILE  /bin/hostname

vinay · May 23, 2022, 12:16pm

Hi Adarsh,

Thank you for replying. That is the issue. in tracejob also I see it has been distributed.
But actually, the job gets executed only on the 1st node.

even if my job is just submitting hostname or an mpi or any python job. We tried various things.

But the job gets submitted only on one node.

Even when we write place=scatter

adarsh · May 23, 2022, 1:56pm

PBS Pro will allocate the resources (compute nodes - cpu, mem, etc) for a job and dynamically creates the host file ($PBS_NODEFILE). It is then upto the underlaying application to use this host information to run across the multiple cores of a single machine or multiple cores of multiple nodes.

if the MPI is tightly integrated with PBS Pro i.e., compiled from source using PBS Prom TM libraries, then there will be proper accounting of the processes and management of the processes spawned by MPI.
Otherwise, if it is MPI binary used directly, then there might be some zombie processes left thta needs cleaning after the job has run. If in case the application has this inbuilt smart mechanism of killing the zombie, then it is well and good.

Please share us the batch command line that is used to run across 2 machines without using PBS . If this is not working then it would not work using PBS Pro as well.

Scatter is a way of splitting the chunks of resources to be used from separate compute nodes in the the PBS complex.

For example if you would like to run 100 core job , but use 10 cores each from 10 compute nodes
#PBS -l select=10:ncpus=10
#PBS -l place=scatter

Otherwise, if you have 1 compute node with 100 cores, then the below directive without place=scatter will run on one system:
#PBS -l select=10:ncpus=10

vinay · May 23, 2022, 2:45pm

Hi Adarsh,

Lets leave mpi. But I dont agree that a system installed openmpi will change the behaviour of PBS submitting jobs.

In this way, there are lot of users who would be compiling lot of things and using their own version. PBS will behave differently for everyone.

Let’s concentrate on this issue, rather going in circles -

The document as well as you say place=scatter should distribute.

To get it more clear will not use mpi for making it simple to tell you -

I am submitting a stress command , which will create 4 stress jobs for 200 sec, you can check it on top /htop after running.

stress --cpu 4 --timeout 200

Below is my pbs script -

#!/bin/bash

#PBS -N ZIS_scon

#PBS -q all

#PBS -l select=2:ncpus=4

#PBS -l place=scatter

#PBS -V

#PBS -o mpiruns.o.txt

#Give the error file name

#PBS -e mpiruns.e.txt

cd $PBS_O_WORKDIR

cat $PBS_NODEFILE > pbs_nodes

echo Working directory is $PBS_O_WORKDIR

NPROCS=wc -l < $PBS_NODEFILE

NNODES=uniq $PBS_NODEFILE | wc -l

Display the job context

echo Running on host hostname

echo Time is date

echo Directory is pwd

echo Using ${NPROCS} processors across ${NNODES} nodes

stress --cpu 4 --timeout 200

After running this as per your saying, as we are using scatter -
node1 should also show stress jobs running as well as node2

but it doesn’t .

Am I misunderstanding the concept of PBS. or I am just not able to make you understand we both are just revolving around the same topic.

Job submission -

node 1 -

node2 -

As you can see there is nothing running on node 2. Even a simple command hostname which will print hostname on the output file. will clearly shows it runs only on one node.

Tracejob output -

It clearly shows it is submitting jobs on both nodes. So why I don’t see it running on node2.

I hope you understand it this time.

adarsh · May 23, 2022, 3:13pm

PBS Pro is a scheduler (work load manager) , it would not parallelize the application or batch command line to run on multiple nodes.

stress --cpu 4 --timeout 200 is a serial batch application and not distributed parallel applicaiton.

Hence PBS Pro would not run
stress -ncpus 4 --timeout 200 on compute node 1
stress -ncpus 4 --timeout 200 on compute node 2

Instead it woudl assign compute node1 and compute node 2 with the requested resources (via the qsub)

PBS Pro would schedule and assign the requested resources to the job (job wide). If the application is distributed parallel then it is given infromation in the form of $PBS_NODEFILE (with or without mpiprocs) which would be used by the underlying applicaiton to run the job

the stress application should be able to read the hosts assigned to the job from PBS Pro and able to divide and distribute it across the hosts or nodes dynamically provided by $PBS_NODEFILE

FYI: Batch environment: PBS - User Documentation - ECMWF Confluence Wiki

mkaro · May 23, 2022, 3:34pm

Assuming that “stress” is an MPI enabled application (calling MPI_Init, etc.) and compiled with mpicc so that it is linked properly to the MPI libraries, I think the piece your missing is “mpirun”. For example…

mpirun -np 4 --hostfile $PBS_NODEFILE stress --cpu 4 --timeout 200

mpirun man page is here: mpirun(1) man page (version 4.1.3)

adarsh · May 23, 2022, 7:06pm

Please try this script (kind of induced parallelism)

#cat stress.sh

#PBS -N stress
#PBS -l select=2:ncpus=4:mpiprocs=4
cd $PBS_O_WORKDIR
total_cores=`cat $PBS_NODEFILE | wc -l `
echo "total_cores=$total_cores"
total_hosts=`cat $PBS_NODEFILE | uniq | wc -l`
echo "total_hosts=$total_hosts"
cores_per_host=$((total_cores / total_hosts))
echo "cores_per_host=$cores_per_host"
echo "running stress"
echo "/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host  --timeout 100s"
/opt/pbs/bin/pbsdsh -- stress --cpu $cores_per_host  --timeout 100s
echo "ending stress"

#qsub stress.sh

vinay · May 24, 2022, 4:22am

Dear Adarsh,

It still runs on one node.-

Below is the stress.sh

#PBS -q all

#PBS -N stress

#PBS -l select=2:ncpus=4:mpiprocs=4

#PBS -o stress.o.txt

##PBS -e stress.e.txt

cd $PBS_O_WORKDIR

total_cores=cat $PBS_NODEFILE | wc -l

echo “total_cores=$total_cores”

total_hosts=cat $PBS_NODEFILE | uniq | wc -l

echo “total_hosts=$total_hosts”

cores_per_host=$((total_cores / total_hosts))

echo “cores_per_host=$cores_per_host”

echo “running stress”

echo “/opt/pbs/bin/pbsdsh – stress --cpu $cores_per_host --timeout 100s”

/opt/pbs/bin/pbsdsh – stress --cpu $cores_per_host --timeout 100s

echo “ending stress”

TRACEJOB -

HTOP -

Node 1 -

Node 2 -

As you can see it is running on single node. but the tracejob does shows that it is spilitting.

adarsh · May 24, 2022, 7:16am

I have tested the same script and it runs on two systems in parallel.
I am now sure whether you have StrictHostkeyCheck turned on the second node and it might causing issues. Please disable it in the sched_config on all the systems in the PBS Complex.

vinay · May 24, 2022, 11:27am

Hi Adarsh,

I can’t find the option “StrictHostkeyCheck” in /var/spool/pbs/sched_priv/sched_config file on headnode.

My configurations are default with no change. I have only edited the /etc/pbs.conf on headnode and compute nodes.

And /var/spool/pbs/mom_priv/config in which I have added the head node hostname.

pbs-server rpm is installed on headnode. and pbs-execution rpm is installed on compute nodes.

sched_config file for your reference - Upload files for free - sched_config - ufile.io

Topic		Replies	Views
How to scatter jobs over vnodes? Users/Site Administrators	30	8204	May 19, 2020
Schedulers doesn't seem to be holding jobs Users/Site Administrators	11	1639	June 18, 2019
Qsub : submit jobs with 17 procs on 5 nodes Users/Site Administrators	13	1551	April 7, 2021
Running cpu and gpu jobs concurrently Users/Site Administrators	11	3564	July 26, 2018
PBS-server not running Developers	31	7184	October 20, 2022

Job not getting distributed among nodes

staring with # is a comment and # PBS is a pbs parameter

Specify time required for your runs. Below example is 1 hours of CPU Time. Please don’t block job for over 24 hours.

Give your run as below

stress --cpu 4 --timeout 200

Display the job context

Related topics