Cannot run job on multiple node

Dear All,

I am running the Quantum Expresso job. I am unable to run the code on multiple nodes. Submitting the jobs with 1 node (16 processors) gets executed without any issues, but the same job with 2 or more nodes gets terminated.

#!/bin/bash
#PBS -N testjob
#PBS -l nodes=2:ppn=16
#PBS -q qreg_1day_small
#PBS -l mem=6gb
#PBS -o outtest.log
#PBS -e Error.log

NPROCS=wc -l < $PBS_NODEFILE
HOSTS=cat $PBS_NODEFILE | uniq | tr '\n' "," | sed 's|,$||'

module load codes/QuantEspre-7.0

TMPDIR=/localscratch/sscnesta/test
mkdir $TMPDIR

cd $TMPDIR

cat > band_calc.in <<EOF
&control
calculation = ‘bands’
verbosity=‘high’
prefix = ‘ws2_bilayer’
outdir = ‘./tmp’
pseudo_dir = ‘/home/phd/19/sscnesta/My_PhD_Work/Pseudo’
restart_mode = ‘restart’
! max_seconds = 7000
! etot_conv_thr = 1.0D-08
! forc_conv_thr = 1.0D-06
nstep = 150
/
&SYSTEM
ibrav = 0
A = 12.85600
nat = 12
ntyp = 2
ecutwfc = 55
ecutrho = 550
occupations = ‘smearing’
smearing = ‘mv’
! nbnd = 100
degauss = 0.001
lspinorb = .TRUE
noncolin = .TRUE
/
&electrons
mixing_mode = ‘local-TF’
mixing_beta = 0.2
electron_maxstep = 400
! conv_thr = 1.0D-12
/
CELL_PARAMETERS {alat}
1.900000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.250910080896080 0.000000000000000
-0.173156793373163 0.000000000000000 0.409082710525738
ATOMIC_SPECIES
S 32.06750 S.rel-pbe-nl-rrkjus_psl.1.0.0.UPF
W 183.84000 W.rel-pbe-spn-rrkjus_psl.1.0.0.UPF
ATOMIC_POSITIONS {crystal}
W 0.391721052631579 0.500000000000000 0.795540000000000
W 0.134594736842105 0.500000000000000 0.204460000000000
W 0.128563157894737 0.000000000000000 0.795540000000000
W 0.397752631578947 0.000000000000000 0.204460000000000
S 0.452789473684211 0.000000000000000 0.679600000000000
S 0.073526315789474 -0.000000000000000 0.320400000000000
S 0.189631578947368 0.500000000000000 0.679600000000000
S 0.336684210526316 0.500000000000000 0.320400000000000
S 0.318421052631579 0.000000000000000 0.787200000000000
S 0.207894736842105 0.000000000000000 0.212800000000000
S 0.055263157894737 0.500000000000000 0.787200000000000
S 0.471052631578947 0.500000000000000 0.212800000000000
K_POINTS crystal_b
5
0.00000 0.00000 0.00000 50
0.85754 0.00000 -0.51435 50
0.85754 0.50000 -0.51435 50
0.00000 0.50000 -0.00000 50
0.00000 0.00000 0.00000 50
EOF

mpirun -machinefile $PBS_NODEFILE -np 32 pw.x < band_calc.in >> band_calc.out

cat > bands.in << EOF
&BANDS
prefix = ‘ws2_bilayer’
outdir= ‘./tmp’
filband=“ws2_bilayer-band”
/
EOF
#mpirun -machinefile $PBS_NODEFILE -np 16 bands.x < bands.in >> bands.out

cp band $PBS_O_WORKDIR

#cd $PBS_O_WORKDIR
_____________________________________________________________________

The following error in the error file :


WARNING: There was an error initializing an OpenFabrics device.

Local host: node4

Local device: mlx5_0


[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198

[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198


mpirun was unable to find the specified executable file, and therefore

did not launch the job. This error was first reported for process

rank 16; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command

  line parameter option (remember that mpirun interprets the first

  unrecognized command line token as the executable). 

Node: node20

Executable: /apps/codes/qe-7.0/bin/pw.x


[node4:120093] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init

[node4:120093] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages

[node4:120093] PMIX ERROR: NO-PERMISSIONS in file dstore_base.c at line 237

[node4:120093] PMIX ERROR: NO-PERMISSIONS in file dstore_base.c at line 246

     -----------------------------------------------------------------------------------------------------------------

Apart from the error message in the error file, the output file also prints “16 processes failed to start”. Kindly let me know if there is anything specific to be included in the job submission script for running QE on multiple nodes.

nodes is replaced with select
ppn is replaced with ncpus
mem is a host level resource
So you can update the pbs directives as below, having mentioned this, this is not causing the problem.
The error that is mentioned above is related to your network fabric/software.
Note: Please try to run your multinode job with popuulated hosts.txt first , to make sure it is running fine with out the scheduler first, and then you can submit as a job to the scheduler.

Hai Adarsh,
I got the same error.

[ssc@rha bands]$ cat Error.log
mkdir: cannot create directory ‘/localscratch/sscnesta/test’: File exists

WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.

Local host: node23

[node23:174920] [[38906,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501

MPI_ABORT was invoked on rank 12 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

[node23:174920] [[38906,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 501
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node23:174920] PMIX ERROR: NO-PERMISSIONS in file dstore_base.c at line 237
[node23:174920] 30 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[node23:174920] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[node23:174920] 30 more processes have sent help message help-mpi-api.txt / mpi-abort


And I got as o/p file as the following error.

[ssc@rha bands]$ cat band_calc.out

16 total processes failed to start
16 total processes failed to start
16 total processes failed to start

 Program PWSCF v.7.0 starts on 28Jun2022 at 14:57:37

 This program is part of the open-source Quantum ESPRESSO suite
 for quantum simulation of materials; please cite
     "P. Giannozzi et al., J. Phys.:Condens. Matter 21 395502 (2009);
     "P. Giannozzi et al., J. Phys.:Condens. Matter 29 465901 (2017);
     "P. Giannozzi et al., J. Chem. Phys. 152 154105 (2020);
      URL http://www.quantum-espresso.org",
 in publications or presentations arising from this work. More details at
 http://www.quantum-espresso.org/quote

 Parallel version (MPI), running on    32 processors

 MPI processes distributed on     1 nodes
 R & G space division:  proc/nbgrp/npool/nimage =      32
 136803 MiB available memory on the printing compute node when the environment starts

 Waiting for input...
 Reading input from standard input

 Current dimensions of program PWSCF are:
 Max number of different atomic species (ntypx) = 10
 Max number of k-points (npk) =  40000
 Max angular momentum in pseudopotentials (lmaxx) =  4
 Message from routine input:
 WARNING: "startingwfc" set to atomic+random may spoil restart
 Message from routine iosys:
 restart disabled: needed files not found
 Message from routine qexsd_readschema :
 xml data file ./tmp/ws2_bilayer.save/data-file-schema.xml not found

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Error in routine setup (1):
problem reading ef from file ./tmp/ws2_bilayer.save/data-file-schema.xml
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

 stopping ...

Are you able to run it on single node ?

And what’s the output of pbsnodes -aSj . Dose all show up and free?

Thank you @AJITH . Until this is fixed , it would expose the same issue/error while running your jobs with openpbs. Could you please try the mpi to use tcp by adding
mpirun --mca btl tcp,self
Reference: FAQ: Tuning the run-time characteristics of MPI TCP communications

+1 as @vinay mentioned, please check whether you can run on one compute node.

yes, I am able to run a job in a single node with 16 processors without an error. if I go with two nodes I got an error.