Thanks @adarsh and @mkaro a lot for the detailed info.
I have both tried your two ways to build openmpi with TM support successfully.
[test
@pbspro
-server bin]$ ./ompi_info | grep ras
MCA ras: loadleveler (MCA v2.
0.0
, API v2.
0.0
, Component v1.
10.7
)
MCA ras: simulator (MCA v2.
0.0
, API v2.
0.0
, Component v1.
10.7
)
MCA ras: slurm (MCA v2.
0.0
, API v2.
0.0
, Component v1.
10.7
)
MCA ras: tm (MCA v2.
0.0
, API v2.
0.0
, Component v1.
10.7
)
The problem I met is that the pbs jobs which are used to test PBS and openmpi integration have hanged.
The steps are as follows:
- the job scripts:
[test@pbspro-server ~]$ cat test.sh
#!/bin/bash
#PBS -N pbs-openmpi-sh
#PBS -l select=2
#PBS -l place=scatter
cat $PBS_NODEFILE
hostnumber=$(cat $PBS_NODEFILE | wc -l)
/opt/openmpi/1.10.7/bin/mpirun hostname
[test@pbspro-server ~]$ cat test3.sh
#PBS -l select=2
#PBS -j oe
/opt/openmpi/1.10.7/bin/mpirun ~/hello_mpi
- the job submission commands:
[test@pbspro-server ~]$ qsub test.sh
8.pbspro-server
[test@pbspro-server ~]$ qsub test3.sh
9.pbspro-server
- the job status:
[test@pbspro-server ~]$ qstat -a
pbspro-server: Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
8.pbspro-server opc workq pbs-openmp 14047 2 2 – – R 00:14
9.pbspro-server opc workq test3.sh 14202 2 2 – – R 00:06
And there is exception in mom_logs:
01/15/2019 13:11:28;0008;pbs_mom;Job;8.pbspro-server;JOIN_JOB as node 1
01/15/2019 13:17:56;0001;pbs_mom;Svr;pbs_mom;Connection timed out (110) in open_demux, open_demux: connect 10.0.16.3:34297
01/15/2019 13:17:56;0001;pbs_mom;Job;8.pbspro-server;task not started, Failure orted -2
01/15/2019 13:17:56;0008;pbs_mom;Job;8.pbspro-server;no active tasks
01/15/2019 13:19:35;0008;pbs_mom;Job;9.pbspro-server;JOIN_JOB as node 1
01/15/2019 13:26:03;0001;pbs_mom;Svr;pbs_mom;Connection timed out (110) in open_demux, open_demux: connect 10.0.16.3:57329
01/15/2019 13:26:03;0001;pbs_mom;Job;9.pbspro-server;task not started, Failure orted -2
01/15/2019 13:26:03;0008;pbs_mom;Job;8.pbspro-server;no active tasks
01/15/2019 13:26:03;0008;pbs_mom;Job;9.pbspro-server;no active tasks
And “10.0.16.3” is one of the pbs worknodes. Have configured passwordless access from server to worknodes, worknodes to worknodes, worknodes to server.
Could you please help check why my two jobs have hanged? Thanks a lot!