Thank you for looking into this issue. I have a small cluster with a headnode and two compute nodes. All servers are running CentOS 7. I have been asked to install and configure Open MPI with PBS Torque. PBS Torque works fine we are able to submit jobs to the compute nodes.
Open MPI version is 4.0.5 and I have installed Open MPI from source.
- Operating system/version: CentOS 7
- Computer hardware: Headnode Intel - Compute nodes AMD
- Network type: TCP - NFS
Details of the problem
I have install and configure Open MPI with the following options “–prefix=/opt/openmpi-4.0.5 --with-tm --enable-orterun-prefix-by-default”. after the installation i ran “ompi_info | grep tm” on all the servers to make that “tm” was installed.
$ ompi_info | grep tm
Configure command line: '--prefix=/opt/openmpi-4.0.5' '--with-tm' '--enable-orterun-prefix-by-default'
MCA ess: tm (MCA v2.1.0, API v3.0.0, Component v4.0.5)
MCA plm: tm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
MCA ras: tm (MCA v2.1.0, API v2.0.0, Component v4.0.5)
I can run simple “mpirun” test like the ones in the example of Open MPI directory.
$ mpirun --hostfile /opt/hostfile -np 64 --map-by node hostname
I even ran a “hello_c” program with an infinite loop in the code so i can “htop” on all the servers and see the allocated CPU’s.
I also ran the ring_c program which sends a simple MPI message around a ring 10 times.
$ mpirun --hostfile /opt/hostfile -np 64 /opt/openmpi-4.0.5/examples/ring
Warning: Permanently added 'tq3,10.112.0.14' (ECDSA) to the list of known hosts.
Warning: Permanently added 'tq4,10.112.0.16' (ECDSA) to the list of known hosts.
Process 0 sending 10 to 1, tag 201 (64 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
etc..............
However, The issue i am having is when I submit “mpirun” job to PBS Torque only 1 core per server is utilized. Although, I specifically asked for all the cores in the cluster (64 cores - 32 cores per compute node).
Here is my shell script:
!/bin/bash
#PBS -k o
#PBS -l nodes=2:ppn=32,walltime=12:00:00
#PBS -W x=nmatchpolicy:exactnode
#PBS -N mpi_test_tq3_4
mpirun /opt/openmpi-4.0.5/examples/ring
This is how i submit the job:
$ qsub -l nodes=2:ppn=32 -W x=nmatchpolicy:exactnode mpi_test.sh
49.mt-manager
The output file shows that only 2 processors were allocated.
$ cat mpi_test_tq3_4.o49
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting
We can see that only 1 processor was allocated to the job “Process 0 sending 10 to 1, tag 201 (2 processes in ring)”
Did I do something wrong when I configured Open MPI with the “–with-tm” ?
Should I have include the path to the PBS Torque install directory (–wiht-tm=/path/to/torque/directory) ?
Am i submitting the job with the proper $PBS option and “qsub” options ?
Thank you in advance for your help in this issue.
Eric