@wgy
I make sure the azure drivers are running by testing them with the following pingpong command:
mpirun -ppn 1 -n 2 -hostfile /home/$USER/nodenames.txt -env I_MPI_FABRICS=dapl -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env I_MPI_DYNAMIC_CONNECTION=0 IMB-MPI1 pingpong
This functions as it should and returns data transfer speeds between 2 nodes. (nodenames.txt contains two machine names).
I totally agree that I should test the PBS scheduler with a script directly from the headnode to see if it works without the Fluent interface.
I’ll test the PBS scheduler by running a dedicated test script (even without Fluent). I’m getting the feeling that it is an ANSYS problem by not passing the correct parameters to the PBS job. So I’ll figure that out as well.
@sgombosi
Output of qstat -f:
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
3.kees06pvxjbox fluent nlass.hpc 00:00:00 R workq
nlass.hpc@kees06pvxjbox:/> qstat -f 3
Job Id: 3.kees06pvxjbox
Job_Name = fluent
Job_Owner = nlass.hpc@10.224.57.39
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.ncpus = 30
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = R
queue = workq
server = kees06pvxjbox
Checkpoint = u
ctime = Mon Dec 19 12:56:03 2016
Error_Path = 10.224.57.39:/mnt/nfsshare/fluent.e3
exec_host = kees06pvx000002/0+kees06pvx000002/1+kees06pvx000002/2+kees06pvx
000002/3+kees06pvx000002/4+kees06pvx000002/5+kees06pvx000002/6+kees06pv
x000002/7+kees06pvx000002/8+kees06pvx000002/9+kees06pvx000002/10+kees06
pvx000002/11+kees06pvx000002/12+kees06pvx000002/13+kees06pvx000002/14+k
ees06pvx000002/15+kees06pvx000003/0+kees06pvx000003/1+kees06pvx000003/2
+kees06pvx000003/3+kees06pvx000003/4+kees06pvx000003/5+kees06pvx000003/
6+kees06pvx000003/7+kees06pvx000003/8+kees06pvx000003/9+kees06pvx000003
/10+kees06pvx000003/11+kees06pvx000003/12+kees06pvx000003/13
exec_vnode = (kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx
000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(ke
es06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus
=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx00000
2:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06p
vx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(
kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncp
us=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000
003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees0
6pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)
+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:n
cpus=1)
Hold_Types = n
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Mon Dec 19 12:56:04 2016
Output_Path = 10.224.57.39:/mnt/nfsshare/fluent.o3
Priority = 0
qtime = Mon Dec 19 12:56:03 2016
Rerunable = True
Resource_List.ncpus = 30
Resource_List.nodect = 30
Resource_List.place = free
Resource_List.select = 30
stime = Mon Dec 19 12:56:04 2016
session_id = 3454
jobdir = /home/nlass.hpc
substate = 42
Variable_List = PBS_O_HOME=/home/nlass.hpc,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=nlass.hpc,
PBS_O_PATH=/mnt/resource/ansys_fluent/v172/fluent/contrib/lnamd64:/mnt
/resource/ansys_fluent/v172/fluent/bin:/mnt/resource/ansys_fluent/v172/
fluent/bin:/mnt/resource/ansys_fluent/v172/fluent/bin:/opt/intel/impi/5
.0.3.048/bin64:/home/nlass.hpc/bin:/usr/local/bin:/usr/bin:/bin:/usr/ga
mes:/usr/lib/mit/bin:/opt/pbs/bin,PBS_O_MAIL=/var/mail/nlass.hpc,
PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/mnt/nfsshare,PBS_O_SYSTEM=Linux,
FLUENT_INC=/mnt/resource/ansys_fluent/v172/fluent,
DISPLAY=kees06pvxjbox:0.0,LM_PBS_GUI=1,
LM_PBS_ARGS=-r17.2.0 3ddp -pinfiniband -mpi=intel -node -t30 -mport 10
.224.57.18:10.224.57.39:49533:0 -g,PBS_O_QUEUE=workq,
PBS_O_HOST=10.224.57.39
comment = Job run at Mon Dec 19 at 12:56 on (kees06pvx000002:ncpus=1)+(kees
06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1
)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:
ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx...
etime = Mon Dec 19 12:56:03 2016
run_count = 1
Submit_arguments = -l select=30 -j oe -v FLUENT_INC=/mnt/resource/ansys_flu
ent/v172/fluent,DISPLAY=kees06pvxjbox:0.0,LM_PBS_GUI=1,
LM_PBS_ARGS=-r17.2.0 3ddp -pinfiniband -mpi=intel -node -t30 -mport 10
.224.57.18:10.224.57.39:49533:0 -g /mnt/resource/ansys_fluent/v172/flu
ent/fluent17.2.0/bin/fluent
project = _pbs_project_default
The output of the tracejob is below (tracejob -n 10 3):
12/19/2016 12:56:03 S enqueuing into workq, state 1 hop 1
12/19/2016 12:56:04 L Considering job to run
12/19/2016 12:56:04 S Job Queued at request of nlass.hpc@10.224.57.39, owner = nlass.hpc@10.224.57.39, job name = fluent, queue = workq
12/19/2016 12:56:04 S Job Run at request of Scheduler@10.224.57.39 on exec_vnode
(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000002:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)+(kees06pvx000003:ncpus=1)
12/19/2016 12:56:04 S Job Modified at request of Scheduler@10.224.57.39
12/19/2016 12:56:04 L Job run
I don’t know were to find jobscript itself. Is it required by PBS? I think ANSYS issues a command to the scheduler which is transformed to a qsub command (point 2 at my intial post). I don’t think there’s a script involved.
The command for qsub is:
qsub -l select=30 -j oe -v "FLUENT_INC=/mnt/resource/ansys_fluent/v172/fluent,DISPLAY=kees06pvxjbox:0.0,LM_PBS_GUI=1,LM_PBS_ARGS=-r17.2.0 3ddp -pinfiniband -mpi=intel -node -t30 -mport 10.224.57.18:10.224.57.39:49533:0 -g " /mnt/resource/ansys_fluent/v172/fluent/fluent17.2.0/bin/fluent
3.kees06pvxjbox
(port numbers may vary with the command above since this job is different).
Kind regards,
Kees