GPU queue not runing jobs

Hi,
I have a ohpc cluster with 10 nodes and 80 cpus which works fine with pbspro. Recently I add a node with 2GPUs (stateless Rocky 8.5) I follow Nvidia recipy for Centos8, the node works fine when I connect via ssh.
The following is the output of device query from cuda samples.
CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: “Quadro P4000”
CUDA Driver Version / Runtime Version 11.6 / 11.5
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8120 MBytes (8514043904 bytes)
(014) Multiprocessors, (128) CUDA Cores/MP: 1792 CUDA Cores
GPU Max Clock rate: 1480 MHz (1.48 GHz)
Memory Clock rate: 3802 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 98304 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 101 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: “Quadro P4000”
CUDA Driver Version / Runtime Version 11.6 / 11.5
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 8120 MBytes (8514043904 bytes)
(014) Multiprocessors, (128) CUDA Cores/MP: 1792 CUDA Cores
GPU Max Clock rate: 1480 MHz (1.48 GHz)
Memory Clock rate: 3802 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 2097152 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 98304 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 179 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Peer access from Quadro P4000 (GPU0) → Quadro P4000 (GPU1) : Yes
Peer access from Quadro P4000 (GPU1) → Quadro P4000 (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.6, CUDA Runtime Version = 11.5, NumDevs = 2
Result = PASS
[root@argo-c10 deviceQuery]#

I have add cuda module in host and library path as follow

export PATH=/usr/local/cuda-11.6/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
I have also set a GPUq and pbs node argo-c10

 Mom = argo-c10.cluster
 ntype = PBS
 state = free
 pcpus = 20
 resources_available.arch = linux
 resources_available.host = argo-c10
 resources_available.mem = 32468180kb
 resources_available.ncpus = 20
 resources_available.ngpus = 2
 resources_available.vnode = argo-c10
 resources_assigned.accelerator_memory = 0kb
 resources_assigned.hbmem = 0kb
 resources_assigned.mem = 0kb
 resources_assigned.naccelerators = 0
 resources_assigned.ncpus = 0
 resources_assigned.vmem = 0kb
 queue = GPUq
 resv_enable = True
 sharing = default_shared
 last_state_change_time = Sun Jan 23 10:28:43 2022

Output from pbsnodes -avSj
vnode state njobs run susp f/t f/t f/t f/t jobs


argo-c1 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c2 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c3 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c4 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c0 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c7 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c6 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c8 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c9 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c5 free 0 0 0 16gb/16gb 8/8 0/0 0/0 –
argo-c10 free 0 0 0 31gb/31gb 20/20 0/0 2/2 –

Output from qmgr -c “print queue GPUq”
create queue GPUq
set queue GPUq queue_type = Execution
set queue GPUq acl_host_enable = False
set queue GPUq resources_max.walltime = 40:00:00
set queue GPUq resources_min.walltime = 00:00:00
set queue GPUq resources_available.ngpus = 2
set queue GPUq enabled = True
set queue GPUq started = True
[root@argo ~]#

But when I try to run a job it stack at qstat! Not enough free nodes available!
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


2209.argo test GPUq myGPUJob – 1 10 – 00:01 Q –
Not Running: Not enough free nodes available

Any idea will be very helpful!

Could you please share the output of the below commands run as root user:

  1. qstat -Bf
  2. qstat -fx 2209
  3. pbsnodes -av

Also, you can remove node to queue association and instead use Qlist (Node affinity in PBS Pro - #2 by adarsh ) :

qmgr -c “unset node argo-c10 queue”

Thanks, I will try it, but at the moment the node is out of order because we have a power problem due to a snowstorm.

Dear adarsh,
sorry for the late response.
I don’t known if it has any meaning for the problem, before the problem show up, one colleague of mine has set up a root password for mysqlDB and after that (stateless nodes/power lost) I had to adjust ww datastore DB and reboot all nodes, Now, none of my nodes (cpu+gpu) can run under pbs, all jobs go to Qstat.

The output of # qstat -Bf is:
Server: argo
server_state = Active
server_host = argo
scheduling = True
total_jobs = 8
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
acl_roots = username@*
operators = username@*
default_queue = N10C80
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
resources_default.place = scatter
default_chunk.ncpus = 1
resources_assigned.mpiprocs = 0
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
scheduler_iteration = 600
flatuid = True
FLicenses = 20000000
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
default_qsub_arguments = -V
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:10000000 Avail_Local:10000000 Used:0 High_Use:
0
pbs_version = 20.0.1
eligible_time_enable = False
job_history_enable = True
max_concurrent_provision = 5
power_provisioning = False
max_job_sequence_id = 9999999

qstat -fx 2239 output is:
[test@argo simple_add1GPUOmp]$ qstat -fx 2239
Job Id: 2239.argo
Job_Name = myGPUJob
Job_Owner = test@argo
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.ncpus = 10
resources_used.vmem = 0kb
resources_used.walltime = 00:00:00
job_state = E
queue = GPUq
server = argo
Checkpoint = u
ctime = Sun Jan 30 23:02:12 2022
Error_Path = argo:/home/test/transfer/CUDA-Demo/simple_add1GPUOmp/myGPUJob.
e2239
exec_host = argo-c10/010
exec_vnode = (argo-c10:ncpus=10:ngpus=1)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Sun Jan 30 23:02:12 2022
Output_Path = argo:/home/test/transfer/CUDA-Demo/simple_add1GPUOmp/myGPUJob
.o2239
Priority = 0
qtime = Sun Jan 30 23:02:12 2022
Rerunable = True
Resource_List.ncpus = 10
Resource_List.ngpus = 1
Resource_List.nodect = 1
Resource_List.place = excl
Resource_List.select = 1:ncpus=10:ompthreads=10:ngpus=1
Resource_List.walltime = 00:01:00
stime = Sun Jan 30 23:02:12 2022
session_id = 20299
jobdir = /home/test
substate = 51
Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=test,
PBS_O_PATH=/usr/local/cuda-11.6/bin:/home/test/.local/bin:/home/test/b
in:/opt/ohpc/pub/mpi/libfabric/1.13.0/bin:/opt/ohpc/pub/mpi/ucx-ohpc/1.
11.2/bin:/opt/ohpc/pub/libs/hwloc/bin:/opt/ohpc/pub/mpi/openmpi4-gnu9/4
.1.1/bin:/opt/ohpc/pub/compiler/gcc/9.4.0/bin:/opt/ohpc/pub/utils/prun/
2.2:/opt/ohpc/pub/utils/autotools/bin:/opt/ohpc/pub/bin:/usr/condabin:/
usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin,
PBS_O_MAIL=/var/spool/mail/test,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/test/transfer/CUDA-Demo/simple_add1GPUOmp,
PBS_O_SYSTEM=Linux,CONDA_SHLVL=0,
UCX_BIN=/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/bin,
LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64:/opt/ohpc/pub/mpi/libfabric
/1.13.0/lib:/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib:/opt/ohpc/pub/libs/hw
loc/lib:/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/lib:/opt/ohpc/pub/compile
r/gcc/9.4.0/lib64,
LS_COLORS=rs=0:di=38;5;33:ln=38;5;51:mh=00:pi=40;38;5;11:so=38;5;13:do
=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=01
;05;37;41:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:t
w=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;40:
.ta
r=38;5;9:.tgz=38;5;9:.arc=38;5;9:.arj=38;5;9:.taz=38;5;9:.lha=38;5
;9:
.lz4=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.
tzo=38;5;9:
.t7z=38;5;9:.zip=38;5;9:.z=38;5;9:.dz=38;5;9:.gz=38;5;9
:.lrz=38;5;9:.lz=38;5;9:.lzo=38;5;9:.xz=38;5;9:.zst=38;5;9:.tzst=
38;5;9:.bz2=38;5;9:.bz=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.tz=38;5;9:
.deb=38;5;9:.rpm=38;5;9:
.jar=38;5;9:.war=38;5;9:.ear=38;5;9:.sar=
38;5;9:
.rar=38;5;9:.alz=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;
9:.7z=38;5;9:.rz=38;5;9:.cab=38;5;9:.wim=38;5;9:.swm=38;5;9:.dwm=
38;5;9:.esd=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.mjpg=38;5;13:.mjpeg
=38;5;13:
.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=
38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=
38;5;13:
.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=
38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=
38;5;13:
.webm=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v
=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=3
8;5;13:
.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38
;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;
13:
.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.ogv=38;5;13:.ogx=38;5;1
3:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.m4a=38;5;45:.mid=38;5;45
:
.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45
:.ra=38;5;45:.wav=38;5;45:.oga=38;5;45:.opus=38;5;45:*.spx=38;5;45:
*.xspf=38;5;45:,LIBFABRIC_DIR=/opt/ohpc/pub/mpi/libfabric/1.13.0,
__LMOD_REF_COUNT_PATH=/usr/local/cuda-11.6/bin:1;/home/test/.local/bin
:1;/home/test/bin:1;/opt/ohpc/pub/mpi/libfabric/1.13.0/bin:1;/opt/ohpc/
pub/mpi/ucx-ohpc/1.11.2/bin:1;/opt/ohpc/pub/libs/hwloc/bin:1;/opt/ohpc/
pub/mpi/openmpi4-gnu9/4.1.1/bin:1;/opt/ohpc/pub/compiler/gcc/9.4.0/bin:
1;/opt/ohpc/pub/utils/prun/2.2:1;/opt/ohpc/pub/utils/autotools/bin:1;/o
pt/ohpc/pub/bin:1;/usr/condabin:1;/usr/local/bin:1;/usr/bin:1;/usr/loca
l/sbin:1;/usr/sbin:1;/opt/pbs/bin:1,
ModuleTable002=MS42IiwKZnVsbE5hbWUgPSAiY3VkYS8xMS42IiwKbG9hZE9yZGVyI
D0gOSwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN0YXR1cyA9ICJhY3RpdmUiLAp1
c2VyTmFtZSA9ICJjdWRhIiwKd1YgPSAiMDAwMDAwMDExLjAwMDAwMDAwNi4qemZpbmFsIiw
KfSwKZ251OSA9IHsKZm4gPSAiL29wdC9vaHBjL3B1Yi9tb2R1bGVmaWxlcy9nbnU5LzkuNC
4wIiwKZnVsbE5hbWUgPSAiZ251OS85LjQuMCIsCmxvYWRPcmRlciA9IDMsCnByb3BUID0ge
30sCnN0YWNrRGVwdGggPSAxLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAiZ251
OSIsCndWID0gIjAwMDAwMDAwOS4wMDAwMDAwMDQuKnpmaW5hbCIsCn0sCmh3bG9jID0gewp
mbiA9ICIvb3B0L29ocGMvcHViL21vZHVs,
INCLUDE=/usr/local/cuda-11.6/include:/opt/ohpc/pub/compiler/gcc/9.4.0/
include,UCX_LIB=/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib,
LMOD_FAMILY_MPI_VERSION=4.1.1,LANG=en_US.UTF-8,HISTCONTROL=ignoredups,
LMOD_FAMILY_COMPILER_VERSION=9.4.0,HOSTNAME=argo,
OLDPWD=/home/test/transfer/CUDA-Demo,UCX_WARN_UNUSED_ENV_VARS=N,
_LMOD_REF_COUNT__LMFILES=/opt/ohpc/pub/modulefiles/autotools:1;/opt/
ohpc/pub/modulefiles/prun/2.2:1;/opt/ohpc/pub/modulefiles/gnu9/9.4.0:1;
/opt/ohpc/pub/modulefiles/hwloc/2.5.0:1;/opt/ohpc/pub/modulefiles/ucx/1
.11.2:1;/opt/ohpc/pub/modulefiles/libfabric/1.13.0:1;/opt/ohpc/pub/modu
ledeps/gnu9/openmpi4/4.1.1:1;/opt/ohpc/pub/modulefiles/ohpc:1;/opt/ohpc
/pub/modulefiles/cuda/11.6:1,HWLOC_BIN=/opt/ohpc/pub/libs/hwloc/bin,
__LMOD_REF_COUNT_INCLUDE=/usr/local/cuda-11.6/include:1;/opt/ohpc/pub/
compiler/gcc/9.4.0/include:1,
__LMOD_REF_COUNT_LD_LIBRARY_PATH=/usr/local/cuda-11.6/lib64:1;/opt/ohp
c/pub/mpi/libfabric/1.13.0/lib:1;/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib:
1;/opt/ohpc/pub/libs/hwloc/lib:1;/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/
lib:1;/opt/ohpc/pub/compiler/gcc/9.4.0/lib64:1,
__LMOD_REF_COUNT_PKG_CONFIG_PATH=/opt/ohpc/pub/mpi/libfabric/1.13.0/li
b/pkgconfig:1;/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/pkgconfig:1;/opt/oh
pc/pub/mpi/openmpi4-gnu9/4.1.1/lib/pkgconfig:1,
ModuleTable004=ICIwMDAwMDAwMDEuMDAwMDAwMDEzLip6ZmluYWwiLAp9LApvaHBjI
D0gewpmbiA9ICIvb3B0L29ocGMvcHViL21vZHVsZWZpbGVzL29ocGMiLApmdWxsTmFtZSA9
ICJvaHBjIiwKbG9hZE9yZGVyID0gOCwKcHJvcFQgPSB7fSwKc3RhY2tEZXB0aCA9IDAsCnN
0YXR1cyA9ICJhY3RpdmUiLAp1c2VyTmFtZSA9ICJvaHBjIiwKd1YgPSAiTS4qemZpbmFsIi
wKfSwKb3Blbm1waTQgPSB7CmZuID0gIi9vcHQvb2hwYy9wdWIvbW9kdWxlZGVwcy9nbnU5L
29wZW5tcGk0LzQuMS4xIiwKZnVsbE5hbWUgPSAib3Blbm1waTQvNC4xLjEiLApsb2FkT3Jk
ZXIgPSA3LApwcm9wVCA9IHt9LApzdGFja0RlcHRoID0gMSwKc3RhdHVzID0gImFjdGl2ZSI
sCnVzZXJOYW1lID0gIm9wZW5tcGk0IiwK,S_COLORS=auto,
which_declare=declare -f,UCX_DIR=/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2,
HWLOC_DIR=/opt/ohpc/pub/libs/hwloc,USER=test,
__LMOD_REF_COUNT_MODULEPATH=/opt/ohpc/pub/moduledeps/gnu9-openmpi4:1;/
opt/ohpc/pub/moduledeps/gnu9:1;/opt/ohpc/pub/modulefiles:1,
__LMOD_REF_COUNT_LOADEDMODULES=autotools:1;prun/2.2:1;gnu9/9.4.0:1;hwl
oc/2.5.0:1;ucx/1.11.2:1;libfabric/1.13.0:1;openmpi4/4.1.1:1;ohpc:1;cuda
/11.6:1,UCX_INC=/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/include,
LIBFABRIC_BIN=/opt/ohpc/pub/mpi/libfabric/1.13.0/bin,
PWD=/home/test/transfer/CUDA-Demo/simple_add1GPUOmp,
SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,HOME=/home/test,
LMOD_COLORIZE=no,LMOD_VERSION=8.5.22,LMOD_SETTARG_CMD=:,
BASH_ENV=/opt/ohpc/admin/lmod/lmod/init/bash,
XDG_DATA_DIRS=/home/test/.local/share/flatpak/exports/share:/var/lib/f
latpak/exports/share:/usr/local/share:/usr/share,
LIBFABRIC_INC=/opt/ohpc/pub/mpi/libfabric/1.13.0/include,
HWLOC_INC=/opt/ohpc/pub/libs/hwloc/include,
ModuleTable001=X01vZHVsZVRhYmxlXyA9IHsKTVR2ZXJzaW9uID0gMywKY19yZWJ1a
WxkVGltZSA9IGZhbHNlLApjX3Nob3J0VGltZSA9IGZhbHNlLApkZXB0aFQgPSB7fSwKZmFt
aWx5ID0gewpNUEkgPSAib3Blbm1waTQiLApjb21waWxlciA9ICJnbnU5IiwKfSwKbVQgPSB
7CmF1dG90b29scyA9IHsKZm4gPSAiL29wdC9vaHBjL3B1Yi9tb2R1bGVmaWxlcy9hdXRvdG
9vbHMiLApmdWxsTmFtZSA9ICJhdXRvdG9vbHMiLApsb2FkT3JkZXIgPSAxLApwcm9wVCA9I
Ht9LApzdGFja0RlcHRoID0gMSwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gImF1
dG90b29scyIsCndWID0gIk0uKnpmaW5hbCIsCn0sCmN1ZGEgPSB7CmZuID0gIi9vcHQvb2h
wYy9wdWIvbW9kdWxlZmlsZXMvY3VkYS8x,
LOADEDMODULES=autotools:prun/2.2:gnu9/9.4.0:hwloc/2.5.0:ucx/1.11.2:lib
fabric/1.13.0:openmpi4/4.1.1:ohpc:cuda/11.6,
ModuleTable006=dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAidWN4IiwKd1YgPSAiM
DAwMDAwMDAxLjAwMDAwMDAxMS4wMDAwMDAwMDIuKnpmaW5hbCIsCn0sCn0sCm1wYXRoQSA9
IHsKIi9vcHQvb2hwYy9wdWIvbW9kdWxlZGVwcy9nbnU5LW9wZW5tcGk0IiwgIi9vcHQvb2h
wYy9wdWIvbW9kdWxlZGVwcy9nbnU5IiwgIi9vcHQvb2hwYy9wdWIvbW9kdWxlZmlsZXMiLA
p9LApzeXN0ZW1CYXNlTVBBVEggPSAiL29wdC9vaHBjL3B1Yi9tb2R1bGVmaWxlcyIsCn0K,
__LMOD_REF_COUNT_MANPATH=/opt/ohpc/pub/mpi/libfabric/1.13.0/share/man:
1;/opt/ohpc/pub/libs/hwloc/man:1;/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/
share/man:1;/opt/ohpc/pub/compiler/gcc/9.4.0/share/man:1;/opt/ohpc/pub/
utils/autotools/share/man:1;/usr/local/share/man:1;/usr/share/man/overr
ides:1;/usr/share/man/en:1;/usr/share/man:1,
ModuleTable003=ZWZpbGVzL2h3bG9jLzIuNS4wIiwKZnVsbE5hbWUgPSAiaHdsb2MvM
i41LjAiLApsb2FkT3JkZXIgPSA0LApwcm9wVCA9IHt9LApyZWZfY291bnQgPSAxLApzdGFj
a0RlcHRoID0gMiwKc3RhdHVzID0gImFjdGl2ZSIsCnVzZXJOYW1lID0gImh3bG9jIiwKd1Y
gPSAiMDAwMDAwMDAyLjAwMDAwMDAwNS4qemZpbmFsIiwKfSwKbGliZmFicmljID0gewpmbi
A9ICIvb3B0L29ocGMvcHViL21vZHVsZWZpbGVzL2xpYmZhYnJpYy8xLjEzLjAiLApmdWxsT
mFtZSA9ICJsaWJmYWJyaWMvMS4xMy4wIiwKbG9hZE9yZGVyID0gNiwKcHJvcFQgPSB7fSwK
cmVmX2NvdW50ID0gMSwKc3RhY2tEZXB0aCA9IDIsCnN0YXR1cyA9ICJhY3RpdmUiLAp1c2V
yTmFtZSA9ICJsaWJmYWJyaWMiLAp3ViA9,LMOD_ROOT=/opt/ohpc/admin/lmod,
MAIL=/var/spool/mail/test,HWLOC_LIB=/opt/ohpc/pub/libs/hwloc/lib,
SHELL=/bin/bash,TERM=xterm-256color,ModuleTable_Sz=6,
TC_LIB_DIR=/usr/lib64/tc,LMOD_FAMILY_COMPILER=gnu9,SHLVL=1,
MANPATH=/opt/ohpc/pub/mpi/libfabric/1.13.0/share/man:/opt/ohpc/pub/lib
s/hwloc/man:/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/share/man:/opt/ohpc/p
ub/compiler/gcc/9.4.0/share/man:/opt/ohpc/pub/utils/autotools/share/man
:/usr/local/share/man:/usr/share/man/overrides:/usr/share/man/en:/usr/s
hare/man:/opt/pbs/share/man,LMOD_PREPEND_BLOCK=normal,
MODULEPATH=/opt/ohpc/pub/moduledeps/gnu9-openmpi4:/opt/ohpc/pub/module
deps/gnu9:/opt/ohpc/pub/modulefiles,
MPI_DIR=/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1,LOGNAME=test,
PATH=/usr/local/cuda-11.6/bin:/home/test/.local/bin:/home/test/bin:/op
t/ohpc/pub/mpi/libfabric/1.13.0/bin:/opt/ohpc/pub/mpi/ucx-ohpc/1.11.2/b
in:/opt/ohpc/pub/libs/hwloc/bin:/opt/ohpc/pub/mpi/openmpi4-gnu9/4.1.1/b
in:/opt/ohpc/pub/compiler/gcc/9.4.0/bin:/opt/ohpc/pub/utils/prun/2.2:/o
pt/ohpc/pub/utils/autotools/bin:/opt/ohpc/pub/bin:/usr/condabin:/usr/lo
cal/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/bin,
LMFILES=/opt/ohpc/pub/modulefiles/autotools:/opt/ohpc/pub/modulefile
s/prun/2.2:/opt/ohpc/pub/modulefiles/gnu9/9.4.0:/opt/ohpc/pub/modulefil
es/hwloc/2.5.0:/opt/ohpc/pub/modulefiles/ucx/1.11.2:/opt/ohpc/pub/modul
efiles/libfabric/1.13.0:/opt/ohpc/pub/moduledeps/gnu9/openmpi4/4.1.1:/o
pt/ohpc/pub/modulefiles/ohpc:/opt/ohpc/pub/modulefiles/cuda/11.6,
MODULESHOME=/opt/ohpc/admin/lmod/lmod,
PKG_CONFIG_PATH=/opt/ohpc/pub/mpi/libfabric/1.13.0/lib/pkgconfig:/opt/
ohpc/pub/mpi/ucx-ohpc/1.11.2/lib/pkgconfig:/opt/ohpc/pub/mpi/openmpi4-g
nu9/4.1.1/lib/pkgconfig,LMOD_SETTARG_FULL_SUPPORT=no,HISTSIZE=1000,
LMOD_PKG=/opt/ohpc/admin/lmod/lmod,
ModuleTable005=d1YgPSAiMDAwMDAwMDA0LjAwMDAwMDAwMS4wMDAwMDAwMDEuKnpma
W5hbCIsCn0sCnBydW4gPSB7CmZuID0gIi9vcHQvb2hwYy9wdWIvbW9kdWxlZmlsZXMvcHJ1
bi8yLjIiLApmdWxsTmFtZSA9ICJwcnVuLzIuMiIsCmxvYWRPcmRlciA9IDIsCnByb3BUID0
ge30sCnN0YWNrRGVwdGggPSAxLApzdGF0dXMgPSAiYWN0aXZlIiwKdXNlck5hbWUgPSAicH
J1biIsCndWID0gIjAwMDAwMDAwMi4wMDAwMDAwMDIuKnpmaW5hbCIsCn0sCnVjeCA9IHsKZ
m4gPSAiL29wdC9vaHBjL3B1Yi9tb2R1bGVmaWxlcy91Y3gvMS4xMS4yIiwKZnVsbE5hbWUg
PSAidWN4LzEuMTEuMiIsCmxvYWRPcmRlciA9IDUsCnByb3BUID0ge30sCnJlZl9jb3VudCA
9IDEsCnN0YWNrRGVwdGggPSAyLApzdGF0,
LMOD_CMD=/opt/ohpc/admin/lmod/lmod/libexec/lmod,
LIBFABRIC_LIB=/opt/ohpc/pub/mpi/libfabric/1.13.0/lib,
LESSOPEN=||/usr/bin/lesspipe.sh %s,LMOD_FULL_SETTARG_SUPPORT=no,
LMOD_DIR=/opt/ohpc/admin/lmod/lmod/libexec,LMOD_FAMILY_MPI=openmpi4,
BASH_FUNC_which%%=() { ( alias; eval ${which_declare} ) | /usr/bin/wh
ich --tty-only --read-alias --read-functions --show-tilde --show-dot "
$@"
},
BASH_FUNC_module%%=() { eval $($LMOD_CMD bash “$@”) && eval $(${LMO
D_SETTARG_CMD:-:} -s sh)
},
BASH_FUNC_ml%%=() { eval $($LMOD_DIR/ml_cmd “$@”)
},
_=/opt/pbs/bin/qsub,PBS_O_QUEUE=GPUq,PBS_O_HOST=argo
comment = Job run at Sun Jan 30 at 23:02 on (argo-c10:ncpus=10:ngpus=1)
etime = Sun Jan 30 23:02:12 2022
run_count = 1
Exit_status = 127
Submit_arguments = myPBSScript.sh
project = _pbs_project_default
Submit_Host = argo

and pbsnodes -av output istest@argo simple_add1GPUOmp]$ pbsnodes -av
argo-c0
Mom = argo-c0.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c0
resources_available.mem = 16403104kb
resources_available.ncpus = 8
resources_available.vnode = argo-c0
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:37:17 2022

argo-c1
Mom = argo-c1.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c1
resources_available.mem = 16403088kb
resources_available.ncpus = 8
resources_available.vnode = argo-c1
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:37:17 2022

argo-c2
Mom = argo-c2.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c2
resources_available.mem = 16403088kb
resources_available.ncpus = 8
resources_available.vnode = argo-c2
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:37:17 2022

argo-c3
Mom = argo-c3.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c3
resources_available.mem = 16403088kb
resources_available.ncpus = 8
resources_available.vnode = argo-c3
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:37:17 2022

argo-c4
Mom = argo-c4.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c4
resources_available.mem = 16403104kb
resources_available.ncpus = 8
resources_available.vnode = argo-c4
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:36:33 2022

argo-c5
Mom = argo-c5.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c5
resources_available.mem = 16403088kb
resources_available.ncpus = 8
resources_available.vnode = argo-c5
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:36:33 2022

argo-c6
Mom = argo-c6.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c6
resources_available.mem = 16403088kb
resources_available.ncpus = 8
resources_available.vnode = argo-c6
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:36:33 2022

argo-c7
Mom = argo-c7.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c7
resources_available.mem = 16403104kb
resources_available.ncpus = 8
resources_available.vnode = argo-c7
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:36:33 2022

argo-c8
Mom = argo-c8.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c8
resources_available.mem = 16403104kb
resources_available.ncpus = 8
resources_available.vnode = argo-c8
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:36:33 2022

argo-c10
Mom = argo-c10.cluster
ntype = PBS
state = free
pcpus = 20
resources_available.arch = linux
resources_available.host = argo-c10
resources_available.mem = 32468176kb
resources_available.ncpus = 20
resources_available.ngpus = 2
resources_available.vnode = argo-c10
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
queue = GPUq
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 23:03:16 2022
last_used_time = Sun Jan 30 23:03:16 2022

argo-c9
Mom = argo-c9.cluster
ntype = PBS
state = free
pcpus = 8
resources_available.arch = linux
resources_available.host = argo-c9
resources_available.mem = 16403088kb
resources_available.ncpus = 8
resources_available.vnode = argo-c9
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = N10C80
resv_enable = True
sharing = default_shared
last_state_change_time = Sun Jan 30 21:36:33 2022

qstat -asx 2239

argo:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


2239.argo test GPUq myGPUJob 20299 1 10 – 00:01 F 00:00
Job run at Sun Jan 30 at 23:02 on (argo-c10:ncpus=10:ngpus=1) and failed

Thank you @msideris for the above details.
It seems the job has run but failed as per these snippet.
Could you please check the .o and .e files ? The mom logs on the GPU node could have some details.

Hello,
I cant find myGPUJob 20299.o* or .e ,actually no job have been created!

The moms log is:
01/30/2022 21:36:27;0002;pbs_mom;Svr;Log;Log opened
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;pbs_version=20.0.1
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;hostname=argo;pbs_leaf_name=argo;pbs_mom_node_name=N/A
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface enp2s0: argo-rbs.cloud.iasa.gr
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface enp0s31f6: argo
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: argo
01/30/2022 21:36:27;0100;pbs_mom;Svr;parse_config;file config
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 192.168.122.1 as authorized
01/30/2022 21:36:27;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 999
01/30/2022 21:36:27;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP authentication method = resvport
01/30/2022 21:36:27;0c06;pbs_mom;TPP;pbs_mom(Main Thread);TPP leaf node names = argo:15003
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 16384
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
01/30/2022 21:36:27;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant mode disabled
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm argo:17001
01/30/2022 21:36:27;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Thread ready
01/30/2022 21:36:27;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 192.168.122.1:15003 to pbs_comm
01/30/2022 21:36:27;0c06;pbs_mom;TPP;leaf_pkt_postsend_handler(Thread 0);Connected to pbs_comm argo:17001
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
01/30/2022 21:36:27;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
01/30/2022 21:36:27;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpoint/
01/30/2022 21:36:27;0002;pbs_mom;n/a;ncpus;hyperthreading disabled
01/30/2022 21:36:27;0002;pbs_mom;n/a;initialize;pcpus=4, OS reports 4 cpu(s)
01/30/2022 21:36:27;0006;pbs_mom;Fil;pbs_mom;Version 20.0.1, started, initialization type = 0
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;Mom pid = 19901 ready, using ports Server:15001 MOM:15002 RM:15003
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
01/30/2022 21:36:27;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at argo:15001
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest 192.168.122.1:15001, msg=“tfd=16, pbs_comm:192.168.122.1:17001: Dest not found”
01/30/2022 21:36:27;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest 192.168.122.1:15001, msg=“tfd=16, pbs_comm:192.168.122.1:17001: Dest not found”

<<Received noroute to dest 192.168.122.1:15001, msg=“tfd=16, pbs_comm:192.168.122.1:17001: Dest not found”>> actually this is my host ip (192.168.122.1) but there shouldn’t be my node address ip (192.168.122.20) or I am wrong?

I saw that scheduler (port_number 15004)and vnode(15002) has different port number is that a problem?

Qmgr: list sched
Sched default
sched_host = argo
pbs_version = 20.0.1
sched_cycle_length = 00:20:00
sched_port = 15004
sched_priv = /var/spool/pbs/sched_priv
sched_log = /var/spool/pbs/sched_logs
scheduling = True
scheduler_iteration = 600
state = idle
preempt_queue_prio = 150
preempt_prio = express_queue, normal_jobs
preempt_order = SCR
preempt_sort = min_time_since_start
log_events = 767
server_dyn_res_alarm = 30

Node argo-c10
Mom = argo-c10.cluster
Port = 15002
pbs_version = 20.0.1
ntype = PBS
state = free
pcpus = 20
resources_available.arch = linux
resources_available.host = argo-c10
resources_available.mem = 32468176kb
resources_available.ncpus = 20
resources_available.ngpus = 2
resources_available.vnode = argo-c10
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
queue = GPUq
resv_enable = True
sharing = default_shared
last_state_change_time = 1643607114

Ports used are as below:
15001 – pbs_server
15002 – pbs_mom
15003 – pbs_resmon
15004 – pbs_sched
15007 – pbs_datastore

Please make sure the DNS is working fine and /etc/hosts is popluated with static IP address and hostnames. /etc/hosts is the same on all the participating systems in the cluster. If you have multiple NICs, then you can use PBS_LEAF_NAME iin the /etc/pbs.conf , and it should refer to the hostname of the interface over which you want openpbs to communicate.

Hello adarsh,
DNS is working the /etc/pbs.conf is fine and also the Leaf is good
I checked the comm logs
02/03/2022 00:31:43;0002;Comm@argo;Svr;Comm@argo;Exiting
02/03/2022 00:31:43;0002;Comm@argo;Svr;Log;Log closed
02/03/2022 00:31:52;0002;Comm@argo;Svr;Log;Log opened
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;pbs_version=20.0.1
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;pbs_build=mach=N/A:security=N/A:configure_args=N/A
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;hostname=argo;pbs_leaf_name=argo;pbs_mom_node_name=N/A
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;ipv4 interface lo: localhost
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;ipv4 interface enp2s0: argo.cloud.iasa.gr
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;ipv4 interface enp0s31f6: argo
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;ipv6 interface lo: argo
02/03/2022 00:31:52;0002;Comm@argo;Svr;Comm@argo;/opt/pbs/sbin/pbs_comm ready (pid=734842), Proxy Name:argo:17001, Threads:4
02/03/2022 00:31:52;0000;Comm@argo;Svr;Comm@argo;Supported authentication method: resvport
02/03/2022 00:31:52;0c06;Comm@argo;TPP;Comm@argo(Thread 2);Thread ready
02/03/2022 00:31:52;0c06;Comm@argo;TPP;Comm@argo(Thread 1);Thread ready
02/03/2022 00:31:52;0c06;Comm@argo;TPP;Comm@argo(Thread 0);Thread ready
02/03/2022 00:31:52;0c06;Comm@argo;TPP;Comm@argo(Thread 3);Thread ready
02/03/2022 00:31:52;0c06;Comm@argo;TPP;Comm@argo(Thread 1);tfd=16, Leaf registered address 192.168.122.1:15004
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 2);tfd=17, Leaf registered address 192.168.122.19:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 3);tfd=18, Leaf registered address 192.168.122.13:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 3);tfd=21, Leaf registered address 192.168.122.18:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 1);tfd=19, Leaf registered address 192.168.122.10:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 1);tfd=22, Leaf registered address 192.168.122.12:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 1);tfd=25, Leaf registered address 192.168.122.11:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 2);tfd=20, Leaf registered address 192.168.122.14:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 2);tfd=23, Leaf registered address 192.168.122.16:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 3);tfd=24, Leaf registered address 192.168.122.15:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 3);tfd=27, Leaf registered address 192.168.122.20:15003
02/03/2022 00:31:55;0c06;Comm@argo;TPP;Comm@argo(Thread 2);tfd=26, Leaf registered address 192.168.122.17:15003
02/03/2022 00:31:58;0c06;Comm@argo;TPP;Comm@argo(Thread 1);tfd=28, Leaf registered address 192.168.122.1:15001
After I delete datastore I get simple nodes (10nodes x 8 cpus) to work with standard queue (workq) , but when I change queue(N10C80)and set nodes to that queue it stops working.
Is there a way to reset pbs in the condition that it was when I installed it at the very first time?

  1. Takes a pbs snapshot ( to save the existing configuration)
  2. stop the pbs services on the PBS Server
  3. move $PBS_HOME/datastore $PBS_HOME/datastore_old
  4. $PBS_EXEC/libexec/installdb
  5. start the pbs services

Hope this helps

Hello adarsh ​​and have a good week,
Yes, I was able to reset PBS, but my problem remains.
I am uploading a new image to my vnodes and if this does not work I will provide new data for Pbs taking into account the 10 CPU nodes since now I can not perform the task on them as well.
I will kindly ask you to help me again, I have done the whole cluster installation from the beginning.
Thanks!

1 Like

Hi adarsh,
On node moms_logs I find the following mysterious error
02/11/2022 06:19:40;0100;pbs_mom;Req;;Type 1 request received from root@192.168.122.1:15001, sock=2
02/11/2022 06:19:40;0100;pbs_mom;Req;;Type 3 request received from root@192.168.122.1:15001, sock=2
02/11/2022 06:19:40;0100;pbs_mom;Req;;Type 5 request received from root@192.168.122.1:15001, sock=2
02/11/2022 06:19:40;0020;pbs_mom;Fil;/var/spool/pbs/spool/2.argo.OU;secure create of file failed for job 2.argo for user 1001
02/11/2022 06:19:40;0020;pbs_mom;Fil;/var/spool/pbs/spool/2.argo.ER;secure create of file failed for job 2.argo for user 1001
02/11/2022 06:19:40;0001;pbs_mom;Job;2.argo;Unable to open standard output/error
02/11/2022 06:19:40;0001;pbs_mom;Job;2.argo;job not started, Retry -3
02/11/2022 06:19:40;0100;pbs_mom;Job;2.argo;task 00000001 cput=00:00:00
02/11/2022 06:19:40;0008;pbs_mom;Job;2.argo;kill_job

I check the /var/spool/pbs/spool/ folder but its empty!
I cant figure out what is happening here, can you help me?

  • might be related to your file system / mount permissions
  • you can run the below commands to see any issues and issues being fixed.

$PBS_EXEC/sbin/pbs_probe -v
$PBS_EXEC/sbin/pbs_probe -f
$PBS_EXEC/sbin/pbs_probe -v

Thanks it worked!
I had some mount problems again thanks for your help!

1 Like