Parallel job not starting correctly

NomisTuo · April 14, 2022, 3:07pm

Dear community,

I am facing an issue with OpenPBS v20.0 that I have issue to understand, so I will describe symptoms.

We are using a commercial code, ANSYS CFX which offers the possibility of performing distributed and shared memory parallel calculations. This tool officially supports PBS.

I setup a PBS server with 3 nodes. I verified that on each node ANSYS CFX is capable of running correctly with it shared memory option. It works perfectly on each node.

When I try to launch exactly the same case via PBS, the calculation starts correctly up until it reaches the moment when the parallel solver should be initiated. At that moment, the code freezes without any particular output. PBS waits that the walltime is exceeded to just kill the process…

No particular message can be found in is not giving any particular message:
mom_logs:
04/14/2022 16:46:13;0100;pbs_mom;Req;;Type 1 request received from root@xxx.xxx.xxx.xxx:15001, sock=10
04/14/2022 16:46:13;0100;pbs_mom;Req;;Type 3 request received from root@xxx.xxx.xxx.xxx:15001, sock=10
04/14/2022 16:46:13;0100;pbs_mom;Req;;Type 5 request received from root@xxx.xxx.xxx.xxx:15001, sock=10
04/14/2022 16:46:13;0008;pbs_mom;Job;194.Server1;Started, pid = 1342717
04/14/2022 16:57:41;0008;pbs_mom;Job;194.Server1;walltime 688 exceeded limit 600
04/14/2022 16:57:41;0008;pbs_mom;Job;194.Server1;kill_job
04/14/2022 16:57:41;0080;pbs_mom;Job;194.Server1;task 00000001 terminated
04/14/2022 16:57:51;0008;pbs_mom;Job;194.Server1;kill_job
04/14/2022 16:57:51;0080;pbs_mom;Job;194.Server1;task 00000001 force exited
04/14/2022 16:57:51;0008;pbs_mom;Job;194.Server1;Terminated
04/14/2022 16:57:51;0100;pbs_mom;Job;194.Server1;task 00000001 cput=00:01:21
04/14/2022 16:57:51;0008;pbs_mom;Job;194.Server1;kill_job
04/14/2022 16:57:51;0100;pbs_mom;Job;194.Server1;Server2 cput=00:01:21 mem=3028896kb
04/14/2022 16:57:51;0100;pbs_mom;Job;194.Server1:Obit sent

Other tools are perfecly working… I facing this issue only with ANSYS CFX.

Have you already observe such behavior? Can you help me to understand the issue?

Thank you in advance.
Best regards.

adarsh · April 15, 2022, 4:35pm

Please check this line, your job was killed due to insufficient walltime request for the job.
It seems job has requested a walltime of 600 seconds. Please submit the job with more walltime or no walltime (if walltime is not defined , then it is 5 years)

NomisTuo · April 16, 2022, 6:19pm

Dean Adarsh,

Thank you for your reply, but as written in my original post, my job freezes way before reaching my walltime value. My job is finally terminated by PBS thanks to the specified walltime value…

Running exactly the same case (requesting the same amount of cores,…) on a node without passing by PBS takes about half of the requested walltime.

Do you have any other idea?

Best regards.

adarsh · April 17, 2022, 1:32pm

Thank you @NomisTuo .

Please check and share your job script
Can you run your script without using PBS and whether it runs or hangs ?
can you check the stdout and stderr file of the job that is deleted by the walltime
when the job is scheduled on the compute node , do you see the processes related to this job

NomisTuo · April 20, 2022, 6:23am

Dear Ararsh,

Here are the requested info:

My job script:
#!/bin/bash -l

#PBS -l walltime=00:10:00
#PBS -l nodes=1:ppn=36
#PBS -q TH
#PBS -o PBS.out
#PBS -e PBS.err

export LD_LIBRARY_PATH=/nfs/soft/ansys_inc/v211/licensingclient/linx64/lib/usr/lib64:/opt/pbs/lib

cd /nfs/users/test/IEP_P3-karnak2-72over96cpus

/nfs/soft/ansys_inc/v211/CFX/bin/cfx5solve -batch -def “CFX_25.def” -fullname “IEP_P3_PBS” -size 1.2 -parallel -part 36 -start-method “Open MPI Local Parallel”

The script works perfectly when it is used “interactively” on each node.
The *out and *err files are obtained from my “home” folder. The error file contains this message:
=>> PBS: job killed: walltime 688 exceeded limit 600
A fatal error has occurred in cfx5solve:

cfx5solve was killed by the user.

The job starts normally on the computational node. It freezes at the moment when it should initiate its parallel computation. (The first part of the tool is purely scalar).

I hope those information will help you to understand the issue… I also contacted the software vendors. They are suspecting that an “environment variable” is interferring.

Thank you in advance.
Best regards

adarsh · April 20, 2022, 4:03pm

Can you try the below script and update the license paths, make sure the inputfile and this pbsscript is in a shared location accessible by all the nodes

#PBS -N pbs-cfx-script
#PBS -l select=1:ncpus=36:mpiprocs=36
#PBS -q TH
#PBS -o PBS.out
#PBS -e PBS.err

cd $PBS_O_WORKDIR
env
echo “======================”
input="CFX_25.def”
export LD_LIBRARY_PATH=/nfs/soft/ansys_inc/v211/licensingclient/linx64/lib/usr/lib64:/opt/pbs/lib
export PATH=/nfs/soft/ansys_inc/v211/CFX/bin
export SERVER=7252@license-server
export ANSYSLI_SERVERS=7253@license-server

hostlist=cat $PBS_NODEFILE | awk 'BEGIN {getline; printf "g"$1}; {printf ",g"$1}; END {printf "\n"}'

nParts=$(cat $PBS_NODEFILE|wc -w)
hostlist=$(cat $PBS_NODEFILE|tr ‘\n’ ‘,’)
time /nfs/soft/ansys_inc/v211/CFX/bin/cfx5solve -batch -def ${input} -part $nParts -par-dist ${hostlist} -start-method “HP MPI Distributed Parallel”
status=$?
exit $status

Thank you for sharing the details, please try the above and check the stdout and stderr file.

NomisTuo · April 25, 2022, 7:40am

Dear adarsh,

Thank you for this submission script. I had to update just one line in order to get things “working”:

export PATH=/nfs/soft/ansys_inc/v211/CFX/bin

changed into

export PATH=/nfs/soft/ansys_inc/v211/CFX/bin:$PATH

In order to have all conventionnal tools easily accessible (awk, cat, …). I also add a walltime limit in order to avoid any “forever” job issue.

Unfortunately, the behaviour of the job is the same as before. The CFX run starts normally untill it reaches the point where the parellel processes need to start. At that moment CFX just freezes…

Here are the content of the files:
stdout:

PBS_ENVIRONMENT=PBS_BATCH
LD_LIBRARY_PATH=/nfs/soft/ansys_inc/v211/licensingclient/linx64/lib/usr/lib64:/opt/pbs/lib
PBS_O_LANG=en_US.UTF-8
MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
HOSTNAME=SERVER2
OLDPWD=/nfs/users/verdebs
PBS_O_HOME=/nfs/users/verdebs
PBS_JOBID=207.SERVER1
ENVIRONMENT=BATCH
PBS_JOBNAME=pbs-cfx-script
NCPUS=36
PBS_O_PATH=/usr/share/Modules/bin:.:/opt/pbs/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/nfs/users/verdebs/.local/bin:/nfs/users/verdebs/bin
which_declare=declare -f
MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl
PBS_O_WORKDIR=/nfs/users/verdebs/IEP_P3-72over96cpus
USER=verdebs
AWP_ROOT211=/nfs/soft/ansys_inc/v211/
PBS_NODEFILE=/var/spool/pbs/aux/207.SERVER1
PBS_TASKNUM=1
PWD=/nfs/users/verdebs/IEP_P3-72over96cpus
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HOME=/nfs/users/verdebs
PBS_MOMPORT=15003
XDG_DATA_DIRS=/nfs/users/verdebs/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
PBS_JOBCOOKIE=7E4DDBBD716F47971EB1A94072A4AD35
PBS_O_SHELL=/bin/bash
TMPDIR=/var/tmp/pbs.207.SERVER1
LOADEDMODULES=
PBS_O_QUEUE=TH
MAIL=/var/spool/mail/verdebs
SHELL=/bin/bash
SHLVL=2
PBS_O_HOST=server1.d10.tes.local
PBS_O_SYSTEM=Linux
MANPATH=::/opt/pbs/share/man
PBS_O_LOGNAME=verdebs
PBS_NODENUM=0
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles
PBS_JOBDIR=/nfs/users/verdebs
LOGNAME=verdebs
MODULEPATH_modshare=/usr/share/modulefiles:1:/usr/share/Modules/modulefiles:1:/etc/modulefiles:1
PATH=/usr/share/Modules/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/nfs/users/verdebs/.local/bin:/nfs/users/verdebs/bin
PBS_QUEUE=TH
MODULESHOME=/usr/share/Modules
HISTSIZE=1000
PBS_O_MAIL=/var/spool/mail/verdebs
OMP_NUM_THREADS=36
LESSOPEN=||/usr/bin/lesspipe.sh %s
BASH_FUNC_which%%=() {  ( alias;
 eval ${which_declare} ) | /usr/bin/which --tty-only --read-alias --read-functions --show-tilde --show-dot $@
}
BASH_FUNC_module%%=() {  unset _mlshdbg;
 if [ "${MODULES_SILENT_SHELL_DEBUG:-0}" = '1' ]; then
 case "$-" in
 *v*x*)
 set +vx;
 _mlshdbg='vx'
 ;;
 *v*)
 set +v;
 _mlshdbg='v'
 ;;
 *x*)
 set +x;
 _mlshdbg='x'
 ;;
 *)
 _mlshdbg=''
 ;;
 esac;
 fi;
 unset _mlre _mlIFS;
 if [ -n "${IFS+x}" ]; then
 _mlIFS=$IFS;
 fi;
 IFS=' ';
 for _mlv in ${MODULES_RUN_QUARANTINE:-};
 do
 if [ "${_mlv}" = "${_mlv##*[!A-Za-z0-9_]}" -a "${_mlv}" = "${_mlv#[0-9]}" ]; then
 if [ -n "`eval 'echo ${'$_mlv'+x}'`" ]; then
 _mlre="${_mlre:-}${_mlv}_modquar='`eval 'echo ${'$_mlv'}'`' ";
 fi;
 _mlrv="MODULES_RUNENV_${_mlv}";
 _mlre="${_mlre:-}${_mlv}='`eval 'echo ${'$_mlrv':-}'`' ";
 fi;
 done;
 if [ -n "${_mlre:-}" ]; then
 eval `eval ${_mlre} /usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash '"$@"'`;
 else
 eval `/usr/bin/tclsh /usr/share/Modules/libexec/modulecmd.tcl bash "$@"`;
 fi;
 _mlstatus=$?;
 if [ -n "${_mlIFS+x}" ]; then
 IFS=$_mlIFS;
 else
 unset IFS;
 fi;
 unset _mlre _mlv _mlrv _mlIFS;
 if [ -n "${_mlshdbg:-}" ]; then
 set -$_mlshdbg;
 fi;
 unset _mlshdbg;
 return $_mlstatus
}
BASH_FUNC_switchml%%=() {  typeset swfound=1;
 if [ "${MODULES_USE_COMPAT_VERSION:-0}" = '1' ]; then
 typeset swname='main';
 if [ -e /usr/share/Modules/libexec/modulecmd.tcl ]; then
 typeset swfound=0;
 unset MODULES_USE_COMPAT_VERSION;
 fi;
 else
 typeset swname='compatibility';
 if [ -e /usr/share/Modules/libexec/modulecmd-compat ]; then
 typeset swfound=0;
 MODULES_USE_COMPAT_VERSION=1;
 export MODULES_USE_COMPAT_VERSION;
 fi;
 fi;
 if [ $swfound -eq 0 ]; then
 echo "Switching to Modules $swname version";
 source /usr/share/Modules/init/bash;
 else
 echo "Cannot switch to Modules $swname version, command not found";
 return 1;
 fi
}
BASH_FUNC_ml%%=() {  module ml "$@"
}
_=/bin/env
======================
g

And the stderr file:

/var/spool/pbs/mom_priv/jobs/207.SERVER1.SC: line 16: /var/spool/pbs/aux/207.SERVER1: Permission denied
A fatal error has occurred in cfx5solve:

cfx5solve was killed by the user.

I hope it will help you to understand the issue.
Thanks a lot for your efforts!

Have a nice day.

adarsh · April 25, 2022, 11:06am

Thank you @NomisTuo for this information and analysis.
It seems the issue is more related to CFX then the workload manager.
It would better if you can contact the vendor support with the above information and test cfx batch command line without using PBS Pro ( do not user interactive here, as the GUI would setup the enviroment variables automatically, which might have been missed while running a batch job). They might be able to guide you through this easily.

application launch issue, might have crashed soon after it was launched for batch mode

NomisTuo · April 25, 2022, 11:11am

Hello adarsh,

Thank you for your reply, but CFX runs perfectly when I try to launch it interactively on each node…

How it could come that CFX works perfectly without using PBS and it stops behaving correctly when it is initiated by PBS?

Best regards.

NomisTuo · April 25, 2022, 2:13pm

Dear adarsh,

I keep on trying to understand the issue I am facing… I just notice using a “Interactive” job that the $PBS_NODEFILE variable shows exactly the same value on each line…

[verdebs@SERVER2 ~]$ cat $PBS_NODEFILE
SERVER2
SERVER2

I would have more expect something like

[verdebs@SERVER2 ~]$ cat $PBS_NODEFILE
SERVER2*01
SERVER2*02

This seems odd to me but maybe I am wrong?

Best regards.

adarsh · April 25, 2022, 2:37pm

> qsub -l select=2:ncpus=2  -l place=free -I 
> cat $PBS_NODEFILE
> server2
> server2
> 
>  qsub -l select=2:ncpus=2:mpiprocs=2  -l place=free -I 
> cat $PBS_NODEFILE
> server2
> server2
> server2
> server2
> 
> #if in case you have two nodes server1 and server2
> qsub -l select=2:ncpus=2:mpiprocs=2   -l place=scatter -I 
> cat $PBS_NODEFILE
> server1
> server1
> server2
> server2

> 
> You can re-write the hostfile to your requirment and use it in the script
> total_cores=`cat $PBS_NODEFILE | wc -l `
> total_chunks=`cat $PBS_NODEFILE | uniq | wc -l `
> proc_per_host=$(expr $total_cores /  $total_chunks)
> for i in `cat $PBS_NODEFILE | uniq ` ; do echo $i:$pro_per_host >> hosts.txt ; done 
> #Point your -host argument to the hosts.txt instead of $PBS_NODEFILE

NomisTuo · April 26, 2022, 6:37am

Dear adarsh,

Thank you for your reply.

Than I think that the PBS_NODEFILE is adequately populated by the system. Just one last question regarding this PBS_NODEFILE and my PBS configuration.

My server is composed of 3 nodes presenting the definition archetype (where X=1, 2 or 3):
Mom = SERVERX.domain.localdomain
ntype = PBS
state = free
pcpus = 96
resources_available.arch = linux
resources_available.host = serverX
resources_available.mem = 394561752kb
resources_available.ncpus = 96
resources_available.vnode = SERVERX
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
license = l
last_state_change_time = Mon Mar 28 17:55:34 2022
last_used_time = Thu Apr 14 16:26:34 2022

Within my PBS_NODEFILE, I get the list containing SERVERX.domain.localdomain repeated the adequate amont of time. Is this normal or should I get either the “host”, ie. serverX, or the “vnode”, ie. SERVERX repeated?

I am just wondering if CFX is experiencing problem due to CFX or due to my PBS setup… Is there some typical sanity check that should be performed on a newly setup PBS installation?

Thank you in advance.
Best regards.

adarsh · April 26, 2022, 7:17am

If your /etc/hosts and DNS resolves serverx.domain.localdomain correctly to a static IP address , which is the IP of the compute node(s), then it is not a problem. Otherwise, you can re-write PBS_NODEFILE with shorthostname to a hosts.txt file and use host.txt.

for i in $PBS_NODE_FILE;do echo i$i | cut -d'.' -f1 >> hosts.txt ; done

It is a common issue,"PBS is a messenger and do not shoot the messenger " stuck between applicaiton and system issues. We always recommend to test the batch command line along with the enviroment variables required to run the applicaiton without using PBS First. If it runs successfully on the compute nodes, then integrate that application in a PBS Script.

Any issues with the PBS Pro: you can increase the log level on the server logs, scheduler logs and mom logs and find out the issue in these log files. Any system related check in the /var/log/messages of the respective hosts.

NomisTuo · April 26, 2022, 8:07am

In this case, CFX runs perfectly using batch command line. The issue is appearing when PBS is used to submit the job…

How may I increase the log level of the different players (server, scheduler and mom)?

Thank you in advance for your help

adarsh · April 26, 2022, 2:44pm

Please share us the .o and .e file that is created after the job has failed, share the application log file
Please share us the batch command line and environment variables used to run the cfx job successful without using pbs.

server logs: qmgr: set server log_events=2047 # default is 511
sched logs: qmgr : set schedlog_events** = 4095 # default is 767
mom logs : edit $PBS_HOME/mom_priv/config and add $logevents 0xffffffff ( kill -HUP <pid of pbs_mom)

Please note you have to revert back to default log levels, otherwise the harddisk space would be consumed in short amount of time.

NomisTuo · April 27, 2022, 7:46am

Dear adarsh,

Here is the .e file created by the job (as mentionned before, it is killed due to the exceeded walltime):

=>> PBS: job killed: walltime 151 exceeded limit 120
A fatal error has occurred in cfx5solve:

cfx5solve was killed by the user.

The application log file (in this case, the CFX out file) is completely normal. It is just that like the process stops when it reaches the “parallel” phase of the process:


 +--------------------------------------------------------------------+
 |                CPU Time Requirements of Partitioner                |
 +--------------------------------------------------------------------+

 Preparations                   1.21E-02    2.7 %
 Low-level Mesh Partitioning    1.65E-03    0.4 %
 File Reading                   1.21E-01   27.3 %
 Partition Smoothing            6.82E-03    1.5 %
 Topology - Domain Interface    9.00E-06    0.0 %
 Topology - Global              4.25E-04    0.1 %
 Topology - Element/Face/Patch  1.16E-03    0.3 %
 Topology - Vertex              1.03E-04    0.0 %
 Data Compression               7.60E-05    0.0 %
 Variable Updates               2.05E-03    0.5 %
 File Writing                   3.66E-03    0.8 %
 Miscellaneous                  2.93E-01   66.3 %
                                --------
 Total                          4.42E-01

 +--------------------------------------------------------------------+
 |                   Job Information at End of Run                    |
 +--------------------------------------------------------------------+

 Host computer:  SERVER2 (PID:2175371)

 Job finished:   Wed Apr 27 08:55:20 2022

 Total wall clock time: 8.980E-01 seconds
             or: (          0:         0:         0:     0.898 )
                 (       Days:     Hours:   Minutes:   Seconds )


 +--------------------------------------------------------------------+
 |                                                                    |
 |                               Solver                               |
 |                                                                    |
 +--------------------------------------------------------------------+


 +--------------------------------------------------------------------+
 |              A fatal error has occurred in cfx5solve:              |
 |                                                                    |
 | cfx5solve was killed by the user.                                  |
 +--------------------------------------------------------------------+

Here is the environment variables used in batch

LD_LIBRARY_PATH=/nfs/soft/ansys_inc/v211/licensingclient/linx64/lib/usr/lib64:/opt/pbs/lib
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:
SSH_CONNECTION=10.42.10.163 33084 10.42.10.164 22
MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
HOSTNAME=SERVER2
OLDPWD=/nfs/users/verdebs
which_declare=declare -f
XDG_SESSION_ID=473
MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl
USER=verdebs
AWP_ROOT211=/nfs/soft/ansys_inc/v211/
PWD=/nfs/users/verdebs/CFXtst
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HOME=/nfs/users/verdebs
SSH_CLIENT=10.42.10.163 33084 22
XDG_DATA_DIRS=/nfs/users/verdebs/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
LOADEDMODULES=
SSH_TTY=/dev/pts/0
MAIL=/var/spool/mail/verdebs
TERM=xterm
SHELL=/bin/bash
SHLVL=1
MANPATH=::/opt/pbs/share/man
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles
LOGNAME=verdebs
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/2000/bus
XDG_RUNTIME_DIR=/run/user/2000
MODULEPATH_modshare=/usr/share/modulefiles:1:/usr/share/Modules/modulefiles:1:/etc/modulefiles:1
PATH=/usr/share/Modules/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/nfs/users/verdebs/.local/bin:/nfs/users/verdebs/bin
MODULESHOME=/usr/share/Modules
HISTSIZE=1000
LESSOPEN=||/usr/bin/lesspipe.sh %s
_=/usr/bin/env

Here is the command used

cfx5solve -example StaticMixer -parallel -part 2 -par-local  -start-method "Open MPI Local Parallel"

If it can help, here is the environment variables during the PBS run:

PBS_ENVIRONMENT=PBS_INTERACTIVE
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=01;05;37;41:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=01;36:*.au=01;36:*.flac=01;36:*.m4a=01;36:*.mid=01;36:*.midi=01;36:*.mka=01;36:*.mp3=01;36:*.mpc=01;36:*.ogg=01;36:*.ra=01;36:*.wav=01;36:*.oga=01;36:*.opus=01;36:*.spx=01;36:*.xspf=01;36:
LD_LIBRARY_PATH=/nfs/soft/ansys_inc/v211/licensingclient/linx64/lib/usr/lib64:/opt/pbs/lib
PBS_O_LANG=en_US.UTF-8
SSH_CONNECTION=10.42.47.245 65371 10.42.10.163 22
MODULES_RUN_QUARANTINE=LD_LIBRARY_PATH LD_PRELOAD
LANG=en_US.UTF-8
HISTCONTROL=ignoredups
DISPLAY=localhost:11.0
HOSTNAME=SERVER2
OLDPWD=/nfs/users/verdebs
PBS_O_HOME=/nfs/users/verdebs
PBS_JOBID=239.SERVER1
PBS_JOBNAME=STDIN
NCPUS=2
PBS_O_PATH=/usr/share/Modules/bin:.:/opt/pbs/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/nfs/users/verdebs/.local/bin:/nfs/users/verdebs/bin
which_declare=declare -f
XDG_SESSION_ID=3070
MODULES_CMD=/usr/share/Modules/libexec/modulecmd.tcl
PBS_O_WORKDIR=/nfs/users/verdebs/CFXtst
USER=verdebs
AWP_ROOT211=/nfs/soft/ansys_inc/v211/
PBS_NODEFILE=/var/spool/pbs/aux/239.SERVER1
PBS_TASKNUM=1
PWD=/nfs/users/verdebs/CFXtst
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass
HOME=/nfs/users/verdebs
SSH_CLIENT=10.42.47.245 65371 22
PBS_MOMPORT=15003
XDG_DATA_DIRS=/nfs/users/verdebs/.local/share/flatpak/exports/share:/var/lib/flatpak/exports/share:/usr/local/share:/usr/share
PBS_JOBCOOKIE=418777613DB694897DEB82F50D9D6593
PBS_O_SHELL=/bin/bash
TMPDIR=/var/tmp/pbs.239.SERVER1
LOADEDMODULES=
SSH_TTY=/dev/pts/1
PBS_O_QUEUE=TH
MAIL=/var/spool/mail/verdebs
SHELL=/bin/bash
TERM=xterm
SHLVL=2
PBS_O_HOST=server1.tes.local
PBS_O_SYSTEM=Linux
MANPATH=::/opt/pbs/share/man:/opt/pbs/share/man
PBS_O_LOGNAME=verdebs
PBS_NODENUM=0
MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles:/usr/share/modulefiles
GDK_BACKEND=x11
PBS_JOBDIR=/nfs/users/verdebs
LOGNAME=verdebs
DBUS_SESSION_BUS_ADDRESS=unix:abstract=/tmp/dbus-1sdDq9P79D,guid=fcb6389845d24d7ebaf0350462678f96
XDG_RUNTIME_DIR=/run/user/2000
MODULEPATH_modshare=/usr/share/modulefiles:1:/usr/share/Modules/modulefiles:1:/etc/modulefiles:1
PATH=/usr/share/Modules/bin:.:/opt/pbs/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin:/nfs/users/verdebs/.local/bin:/nfs/users/verdebs/bin:/opt/pbs/bin:/nfs/users/verdebs/.local/bin:/nfs/users/verdebs/bin
PBS_QUEUE=TH
MODULESHOME=/usr/share/Modules
HISTSIZE=1000
OMP_NUM_THREADS=2
LESSOPEN=||/usr/bin/lesspipe.sh %s
PBS_O_MAIL=/var/spool/mail/verdebs
_=/usr/bin/env

Once the job starts, I do see on SERVER2 the process named 239.SERVER1.SC. I also sees the usual ANSYS subprocesses kicked in. Once the “parallel” solving process should appear there is only 1 mpirun process starting instead of the 2 requested…

I do not know if this can be helpful, but we running PBS on a CentOS Stream 8. Could it be that the system prohibits PBS to initiate some process?

I will try to increase the log_events values as suggested and look at the variaous log files trying to detect something unsual…

Thank you in advance.

adarsh · April 27, 2022, 3:08pm

So you run the below command opening a terminal on the compute node (server2) and it works

cfx5solve -example StaticMixer -parallel -part 2 -par-local  -start-method "Open MPI Local Parallel"

As you can see the PBS Pro executes the batch command lines that are mentioned in the script submitted. Finally, it is the cfx that is going to solve the problem and PBS is just waiting for its exit status ( once cfx batch is executed it is the cfx that controls and spawns the processes) and PBS does not come into picture . PBS Provides the chunks/cores that is allocated to this job via $PBS_NODEFILE for the cfx command to use it.

There are many using cfx with PBS Pro , example: https://www.usq.edu.au/-/media/usq/current-students/academic/research/conducting-research/eresearch-services/hpc/pbs-ansys-examples.ashx?la=en&hash=8C8DE80D7F89273F3DB0160ACBE1354B

You would need to contact CFX support team and get their help.

Topic		Replies	Views
PBS-server not running Developers	31	7060	October 20, 2022
Job sent on OpenPBS hangs in R status Users/Site Administrators	13	2281	January 31, 2022
Job not getting distributed among nodes Users/Site Administrators	41	3082	June 19, 2022
Job performance is lower when scheduled through pbs Users/Site Administrators	19	1868	March 18, 2022
How to write a script for a program run in two hosts? Users/Site Administrators	21	4754	January 17, 2019

Parallel job not starting correctly

Related topics