Dear All,
I am running the Quantum Expresso job. I am unable to run the code on multiple nodes. Submitting the jobs with 1 node (16 processors) gets executed without any issues, but the same job with 2 or more nodes gets terminated.
#!/bin/bash
#PBS -N testjob
#PBS -l nodes=2:ppn=16
#PBS -q qreg_1day_small
#PBS -l mem=6gb
#PBS -o outtest.log
#PBS -e Error.log
NPROCS=wc -l < $PBS_NODEFILE
HOSTS=cat $PBS_NODEFILE | uniq | tr '\n' "," | sed 's|,$||'
module load codes/QuantEspre-7.0
TMPDIR=/localscratch/sscnesta/test
mkdir $TMPDIR
cd $TMPDIR
cat > band_calc.in <<EOF
&control
calculation = ‘bands’
verbosity=‘high’
prefix = ‘ws2_bilayer’
outdir = ‘./tmp’
pseudo_dir = ‘/home/phd/19/sscnesta/My_PhD_Work/Pseudo’
restart_mode = ‘restart’
! max_seconds = 7000
! etot_conv_thr = 1.0D-08
! forc_conv_thr = 1.0D-06
nstep = 150
/
&SYSTEM
ibrav = 0
A = 12.85600
nat = 12
ntyp = 2
ecutwfc = 55
ecutrho = 550
occupations = ‘smearing’
smearing = ‘mv’
! nbnd = 100
degauss = 0.001
lspinorb = .TRUE
noncolin = .TRUE
/
&electrons
mixing_mode = ‘local-TF’
mixing_beta = 0.2
electron_maxstep = 400
! conv_thr = 1.0D-12
/
CELL_PARAMETERS {alat}
1.900000000000000 0.000000000000000 0.000000000000000
0.000000000000000 0.250910080896080 0.000000000000000
-0.173156793373163 0.000000000000000 0.409082710525738
ATOMIC_SPECIES
S 32.06750 S.rel-pbe-nl-rrkjus_psl.1.0.0.UPF
W 183.84000 W.rel-pbe-spn-rrkjus_psl.1.0.0.UPF
ATOMIC_POSITIONS {crystal}
W 0.391721052631579 0.500000000000000 0.795540000000000
W 0.134594736842105 0.500000000000000 0.204460000000000
W 0.128563157894737 0.000000000000000 0.795540000000000
W 0.397752631578947 0.000000000000000 0.204460000000000
S 0.452789473684211 0.000000000000000 0.679600000000000
S 0.073526315789474 -0.000000000000000 0.320400000000000
S 0.189631578947368 0.500000000000000 0.679600000000000
S 0.336684210526316 0.500000000000000 0.320400000000000
S 0.318421052631579 0.000000000000000 0.787200000000000
S 0.207894736842105 0.000000000000000 0.212800000000000
S 0.055263157894737 0.500000000000000 0.787200000000000
S 0.471052631578947 0.500000000000000 0.212800000000000
K_POINTS crystal_b
5
0.00000 0.00000 0.00000 50
0.85754 0.00000 -0.51435 50
0.85754 0.50000 -0.51435 50
0.00000 0.50000 -0.00000 50
0.00000 0.00000 0.00000 50
EOF
mpirun -machinefile $PBS_NODEFILE -np 32 pw.x < band_calc.in >> band_calc.out
cat > bands.in << EOF
&BANDS
prefix = ‘ws2_bilayer’
outdir= ‘./tmp’
filband=“ws2_bilayer-band”
/
EOF
#mpirun -machinefile $PBS_NODEFILE -np 16 bands.x < bands.in >> bands.out
cp band $PBS_O_WORKDIR
#cd $PBS_O_WORKDIR
_____________________________________________________________________
The following error in the error file :
WARNING: There was an error initializing an OpenFabrics device.
Local host: node4
Local device: mlx5_0
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
[node4:120093] PMIX ERROR: UNREACHABLE in file server/pmix_server.c at line 2198
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 16; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: node20
Executable: /apps/codes/qe-7.0/bin/pw.x
[node4:120093] 15 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[node4:120093] Set MCA parameter “orte_base_help_aggregate” to 0 to see all help / error messages
[node4:120093] PMIX ERROR: NO-PERMISSIONS in file dstore_base.c at line 237
[node4:120093] PMIX ERROR: NO-PERMISSIONS in file dstore_base.c at line 246
-----------------------------------------------------------------------------------------------------------------
Apart from the error message in the error file, the output file also prints “16 processes failed to start”. Kindly let me know if there is anything specific to be included in the job submission script for running QE on multiple nodes.