Mpirun not working

Dear Team,

Need your help.
Point 1

mpirun (intel oneapi) is not working in pbs-server-2021.1.3. but mpiexec.hydra is working fine.

Please find the error.

[mpiexec@node3] check_exit_codes (…/…/…/…/…/src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node4.head.cm.ibdc.res.in (pid 379277, exit code 256)
[mpiexec@node3] poll_for_event (…/…/…/…/…/src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node3] HYD_dmx_poll_wait_for_proxy_event (…/…/…/…/…/src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node3] HYD_bstrap_setup (…/…/…/…/…/src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@node3] HYD_print_bstrap_setup_error_message (…/…/…/…/…/src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@node3] Possible reasons:
[mpiexec@node3] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node3] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node3] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE

Point 2
After job submission output file not generated, generated after completion of job.

Regards
Narayan Pradhan

Please share the pbs script . Please check whether you have mentioned all the environment varialbes required for IntelMPI in the script. Please let us know about the test done with mpiexec.hydra.

Hello Adarsh,

Please find the script

#!/bin/bash
#PBS -S /bin/bash
#PBS -N Test
#PBS -o Test.out
#PBS -e Test.err
#PBS -l select=5:ncpus=64:mpiprocs=64
#PBS -l walltime=24:00:00
#PBS -q workq
#PBS -joe
#PBS -V

#export I_MPI_FABRICS=shm:ofi
#export I_MPI_OFI_LIBRARY_INTERNAL=0
#export FI_MR_CACHE_MONITOR=memhooks
#export I_MPI_MALLOC=0

module load oneapi/2022

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > pbs_nodes

echo Working directory is $PBS_O_WORKDIR
NPROCS=wc -l < $PBS_NODEFILE
NNODES=uniq $PBS_NODEFILE | wc -l

mpirun -genv I_MPI_DEBUG=6 -np $NPROCS -f $PBS_NODEFILE a.out

#####################################

I have tested with mpiexec.hydra working fine.

Thank you @narayan
Please try the below script and let us know ( update the absolute path to mpirun and a.out files).
Share us the .o and .e files

#!/bin/bash
#PBS -S /bin/bash
#PBS -N Test
#PBS -o Test.out
#PBS -e Test.err
#PBS -l select=5:ncpus=64:mpiprocs=64
#PBS -l place=scatter
#PBS -l walltime=24:00:00
#PBS -q workq
#PBS -joe
#PBS -V

#export I_MPI_FABRICS=shm:ofi
#export I_MPI_OFI_LIBRARY_INTERNAL=0
#export FI_MR_CACHE_MONITOR=memhooks
#export I_MPI_MALLOC=0

module load oneapi/2022

source /etc/pbs.conf
export PATH=$PBS_EXEC/bin:$PATH
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_HYDRA_BOOTSTRAP_EXEC=pbs_tmrsh

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > pbs_nodes

echo “Working directory is $PBS_O_WORKDIR”
NPROCS=wc -l < $PBS_NODEFILE
NNODES=uniq $PBS_NODEFILE | wc -l

/absolute/path/to/mpirun -genv I_MPI_DEBUG=6 -np $NPROCS -f $PBS_NODEFILE /absolute/path/to/a.out

Hi Adarsh,

Thank you Adarsh.

I have run the script as per suggestion, getting errors.

[ test-na]$ cat Test.out
/var/spool/PBS/mom_priv/jobs/281.brahm-login.SC: line 29: -l: command not found
/var/spool/PBS/mom_priv/jobs/281.brahm-login.SC: line 30: /var/spool/PBS/aux/281.brahm-login: Permission denied
0
[mpiexec@node3] i_np_fn (…/…/…/…/…/src/pm/i_hydra/mpiexec/intel/i_mpiexec_params.h:942): process count should be > 0
[mpiexec@node3] match_arg (…/…/…/…/…/src/pm/i_hydra/libhydra/arg/hydra_arg.c:83): match handler returned error
[mpiexec@node3] HYD_arg_parse_array (…/…/…/…/…/src/pm/i_hydra/libhydra/arg/hydra_arg.c:128): argument matching returned error
[mpiexec@node3] mpiexec_get_parameters (…/…/…/…/…/src/pm/i_hydra/mpiexec/mpiexec_params.c:1359): error parsing input array
[mpiexec@node3] main (…/…/…/…/…/src/pm/i_hydra/mpiexec/mpiexec.c:1783): error parsing parameters.


please find the below modified Script.

#!/bin/bash
#PBS -S /bin/bash
#PBS -N Test
#PBS -o Test.out
#PBS -e Test.err
#PBS -l select=5:ncpus=64:mpiprocs=64
#PBS -l place=scatter
#PBS -l walltime=24:00:00
#PBS -q workq
#PBS -joe
#PBS -V

#export I_MPI_FABRICS=shm:ofi
#export I_MPI_OFI_LIBRARY_INTERNAL=0
#export FI_MR_CACHE_MONITOR=memhooks
#export I_MPI_MALLOC=0

module load oneapi/2022

source /etc/pbs.conf
export PATH=$PBS_EXEC/bin:$PATH
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_HYDRA_BOOTSTRAP_EXEC=pbs_tmrsh

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > pbs_nodes

echo “Working directory is $PBS_O_WORKDIR”
NPROCS=wc -l < $PBS_NODEFILE
NNODES=uniq $PBS_NODEFILE | wc -l

/opt/apps/oneapi/mpi/2021.6.0/bin/mpirun -genv I_MPI_DEBUG=6 -np $NPROCS -f $PBS_NODEFILE /ibdc-hpc/locuztest/test-na/a.out

The below lines should have been , the backtick was missing

NPROCS=$(wc -l < $PBS_NODEFILE)
NNODES=$(uniq $PBS_NODEFILE | wc -l)

Hi Adarsh,

After backtick mention line still error through, Please find the below errors.

[mpiexec@node3] check_exit_codes (…/…/…/…/…/src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:117): unable to run bstrap_proxy on node4.head.cm.ibdc.res.in (pid 414640, exit code 256)
[mpiexec@node3] poll_for_event (…/…/…/…/…/src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:159): check exit codes error
[mpiexec@node3] HYD_dmx_poll_wait_for_proxy_event (…/…/…/…/…/src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:212): poll for event error
[mpiexec@node3] HYD_bstrap_setup (…/…/…/…/…/src/pm/i_hydra/libhydra/bstrap/src/intel/i_hydra_bstrap.c:1061): error waiting for event
[mpiexec@node3] HYD_print_bstrap_setup_error_message (…/…/…/…/…/src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:1027): error setting up the bootstrap proxies
[mpiexec@node3] Possible reasons:
[mpiexec@node3] 1. Host is unavailable. Please check that all hosts are available.
[mpiexec@node3] 2. Cannot launch hydra_bstrap_proxy or it crashed on one of the hosts. Make sure hydra_bstrap_proxy is available on all hosts and it has right permissions.
[mpiexec@node3] 3. Firewall refused connection. Check that enough ports are allowed in the firewall and specify them with the I_MPI_PORT_RANGE variable.
[mpiexec@node3] 4. pbs bootstrap cannot launch processes on remote host. You may try using -bootstrap option to select alternative launcher.


Modified Script

#!/bin/bash
#PBS -S /bin/bash
#PBS -N Test
#PBS -o Test.out
#PBS -e Test.err
#PBS -l select=5:ncpus=64:mpiprocs=64
#PBS -l place=scatter
#PBS -l walltime=24:00:00
#PBS -q workq
#PBS -joe
#PBS -V

#export I_MPI_FABRICS=shm:ofi
#export I_MPI_OFI_LIBRARY_INTERNAL=0
#export FI_MR_CACHE_MONITOR=memhooks
#export I_MPI_MALLOC=0

module load oneapi/2022

source /etc/pbs.conf
export PATH=$PBS_EXEC/bin:$PATH
I_MPI_HYDRA_BOOTSTRAP=rsh
I_MPI_HYDRA_BOOTSTRAP_EXEC=pbs_tmrsh

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE > pbs_nodes

echo “Working directory is $PBS_O_WORKDIR”
NPROCS=$(wc -l < $PBS_NODEFILE)
NNODES=$(uniq $PBS_NODEFILE | wc -l)

/opt/apps/oneapi/mpi/2021.6.0/bin/mpirun -genv I_MPI_DEBUG=6 -np $NPROCS -f $PBS_NODEFILE /ibdc-hpc/locuztest/test-na/a.out

Please check this link : Solved: Setting up the Intel® oneAPI MPI Library on a Linux cluster - Intel Communities

Dear Adarsh,

Thank you for your help.
Shared link help to find the solutions.

1 Like