Configure OpenMPI with PBS Torque

Thank you for looking into this issue. I have a small cluster with a headnode and two compute nodes. All servers are running CentOS 7. I have been asked to install and configure Open MPI with PBS Torque. PBS Torque works fine we are able to submit jobs to the compute nodes.

Open MPI version is 4.0.5 and I have installed Open MPI from source.

  • Operating system/version: CentOS 7
  • Computer hardware: Headnode Intel - Compute nodes AMD
  • Network type: TCP - NFS

Details of the problem

I have install and configure Open MPI with the following options “–prefix=/opt/openmpi-4.0.5 --with-tm --enable-orterun-prefix-by-default”. after the installation i ran “ompi_info | grep tm” on all the servers to make that “tm” was installed.

$ ompi_info |  grep tm
  Configure command line: '--prefix=/opt/openmpi-4.0.5' '--with-tm' '--enable-orterun-prefix-by-default'
                 MCA ess: tm (MCA v2.1.0, API v3.0.0, Component v4.0.5)
                 MCA plm: tm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
                 MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v4.0.5)

I can run simple “mpirun” test like the ones in the example of Open MPI directory.

$  mpirun --hostfile /opt/hostfile -np 64 --map-by node hostname

I even ran a “hello_c” program with an infinite loop in the code so i can “htop” on all the servers and see the allocated CPU’s.

I also ran the ring_c program which sends a simple MPI message around a ring 10 times.

$ mpirun  --hostfile /opt/hostfile -np 64 /opt/openmpi-4.0.5/examples/ring
Warning: Permanently added 'tq3,10.112.0.14' (ECDSA) to the list of known hosts.
Warning: Permanently added 'tq4,10.112.0.16' (ECDSA) to the list of known hosts.
Process 0 sending 10 to 1, tag 201 (64 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
etc..............

However, The issue i am having is when I submit “mpirun” job to PBS Torque only 1 core per server is utilized. Although, I specifically asked for all the cores in the cluster (64 cores - 32 cores per compute node).

Here is my shell script:

!/bin/bash
#PBS -k o
#PBS -l nodes=2:ppn=32,walltime=12:00:00
#PBS -W x=nmatchpolicy:exactnode
#PBS -N mpi_test_tq3_4

mpirun   /opt/openmpi-4.0.5/examples/ring

This is how i submit the job:

$ qsub -l nodes=2:ppn=32 -W x=nmatchpolicy:exactnode mpi_test.sh
49.mt-manager

The output file shows that only 2 processors were allocated.

$ cat  mpi_test_tq3_4.o49
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

We can see that only 1 processor was allocated to the job “Process 0 sending 10 to 1, tag 201 (2 processes in ring)”
Did I do something wrong when I configured Open MPI with the “–with-tm” ?
Should I have include the path to the PBS Torque install directory (–wiht-tm=/path/to/torque/directory) ?
Am i submitting the job with the proper $PBS option and “qsub” options ?

Thank you in advance for your help in this issue.

Eric

openpbs and torque are different. Just in case this is PBS OSS / openpbs forum.

Compiling OpenMPI with openpbs, this forum discussion might be helpful

Thank you.

Thank you for the history course, I appreciate it. Always nice to learn new things.
I was actually researching on the internet a forum where users would the same issue that I do.
I came across OpenPBS and I thought that OpenPBS was “somewhat” similar as Torque and members could point me to the right direction which you did. I will look at the forum discussion you mentioned.

Thank you again for your help.

1 Like