Job performance is lower when scheduled through pbs

Hi all:
I am using openpbs v20.0.1 on our computing cluster, and I am also a heavy user of molecular dynamics software, such as GROMACS, Lammps. Recently, I noticed that the job performace is much lower when submitted through pbs, compared with directly running with on the same computing node through ssh login.

The perfomance of lammps through PBS is: 5.363 ns/day, 4.475 hours/ns, 62.068 timesteps/s, 91.0% CPU use with 24 MPI tasks x 1 OpenMP threads

The Device Time Information through PBS is:
Data Transfer: 7.1626 s.
Neighbor copy: 0.0257 s.
Neighbor build: 0.0926 s.
Force calc: 10.1331 s.
Device Overhead: 2.5805 s.
Average split: 1.0000.
Lanes / atom: 4.
Vector width: 32.
Max Mem / Proc: 33.78 MB.
CPU Neighbor: 0.3908 s.
CPU Cast/Pack: 24.4776 s.
CPU Driver_Time: 0.1674 s.
CPU Idle_Time: 3.9796 s.

At the meantime, when running the same job with the same resource configuration on the same computing node through ssh login without PBS, the performance is: 11.133 ns/day, 2.156 hours/ns, 128.850 timesteps/s 86.2% CPU use with 24 MPI tasks x 1 OpenMP threads

The Device Time Information through ssh is:
Data Transfer: 69.7440 s.
Neighbor copy: 0.0488 s.
Neighbor build: 0.4698 s.
Force calc: 67.2331 s.
Device Overhead: 17.6636 s.
Average split: 1.0000.
Lanes / atom: 4.
Vector width: 32.
Max Mem / Proc: 31.92 MB.
CPU Neighbor: 0.6567 s.
CPU Cast/Pack: 42.5587 s.
CPU Driver_Time: 0.4871 s.
CPU Idle_Time: 43.5711 s.

It is evident that the performance of directly running without PBS is twice faster than that with PBS. I have no idea how this happens. Could you give some advices?

Sincerely Pan

Could you please share the job script used to run the LAMMPS

  • with openPBS and env output
  • without openPBS and env output

which MPI ? Intel MPI or openMPI
whether this MPI was compiled from source with openPBS TM ?

My pbs job script contents is:

#PBS -q gpu
#PBS -l walltime=168:00:00
#PBS -l select=4:ncpus=6:ngpus=1:mem=12gb:mpiprocs=6:host=g6:ompthreads=1

cd $PBS_O_WORKDIR
module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2
mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_pbs.dat

The env output with openPBS is:

LAMMPS (29 Sep 2021)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.5200000 3.5200000 3.5200000
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (176.00000 176.00000 176.00000)
3 by 2 by 4 MPI processor grid
reading atoms …
500000 atoms
reading velocities …
500000 velocities
read_data CPU = 1.204 seconds

  • Using acceleration for eam/alloy:
  • with 6 proc(s) per device.
  • Horizontal vector operations: ENABLED
  • Shared memory system: No

Device 0: NVIDIA GeForce RTX 3080, 68 CUs, 8.5/9.8 GB, 1.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 2: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 3: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)

Initializing Device and compiling on process 0…Done.
Initializing Devices 0-3 on core 0…Done.
Initializing Devices 0-3 on core 1…Done.
Initializing Devices 0-3 on core 2…Done.
Initializing Devices 0-3 on core 3…Done.
Initializing Devices 0-3 on core 4…Done.
Initializing Devices 0-3 on core 5…Done.

My job script withous pbs is a bash script, the content is:

module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2
mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_direct.dat

The env output without pbs is:
LAMMPS (29 Sep 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.5200000 3.5200000 3.5200000
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (176.00000 176.00000 176.00000)
3 by 2 by 4 MPI processor grid
reading atoms …
500000 atoms
reading velocities …
500000 velocities
read_data CPU = 0.732 seconds

  • Using acceleration for eam/alloy:
  • with 6 proc(s) per device.
  • Horizontal vector operations: ENABLED
  • Shared memory system: No

Device 0: NVIDIA GeForce RTX 3080, 68 CUs, 8.5/9.8 GB, 1.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 2: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 3: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)

Initializing Device and compiling on process 0…Done.
Initializing Devices 0-3 on core 0…Done.
Initializing Devices 0-3 on core 1…Done.
Initializing Devices 0-3 on core 2…Done.
Initializing Devices 0-3 on core 3…Done.
Initializing Devices 0-3 on core 4…Done.
Initializing Devices 0-3 on core 5…Done.

I compiled lammps with openmpi-4.0.6 from the source code. I don’t know what is openPBS TM, I think I don’t use that.

My job script with openpbs is:

#PBS -q gpu
#PBS -l walltime=168:00:00
#PBS -l select=4:ncpus=6:ngpus=1:mem=12gb:mpiprocs=6:host=g6:ompthreads=1

cd $PBS_O_WORKDIR

module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2

mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_pbs.dat

The env output with pbs is:
LAMMPS (29 Sep 2021)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.5200000 3.5200000 3.5200000
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (176.00000 176.00000 176.00000)
3 by 2 by 4 MPI processor grid
reading atoms …
500000 atoms
reading velocities …
500000 velocities
read_data CPU = 1.204 seconds

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

  • GPU package (short-range, long-range and three-body potentials):
    The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE


  • Using acceleration for eam/alloy:
  • with 6 proc(s) per device.
  • Horizontal vector operations: ENABLED
  • Shared memory system: No

Device 0: NVIDIA GeForce RTX 3080, 68 CUs, 8.5/9.8 GB, 1.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 2: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 3: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)

Initializing Device and compiling on process 0…Done.
Initializing Devices 0-3 on core 0…Done.
Initializing Devices 0-3 on core 1…Done.
Initializing Devices 0-3 on core 2…Done.
Initializing Devices 0-3 on core 3…Done.
Initializing Devices 0-3 on core 4…Done.
Initializing Devices 0-3 on core 5…Done.

My job script without openpbs is a bash script, whose contents are:

module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2
mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_direct.dat

The env output without pbs is:
LAMMPS (29 Sep 2021)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.5200000 3.5200000 3.5200000
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (176.00000 176.00000 176.00000)
3 by 2 by 4 MPI processor grid
reading atoms …
500000 atoms
reading velocities …
500000 velocities
read_data CPU = 0.732 seconds

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

  • GPU package (short-range, long-range and three-body potentials):
    The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE


  • Using acceleration for eam/alloy:
  • with 6 proc(s) per device.
  • Horizontal vector operations: ENABLED
  • Shared memory system: No

Device 0: NVIDIA GeForce RTX 3080, 68 CUs, 8.5/9.8 GB, 1.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 2: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 3: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)

Initializing Device and compiling on process 0…Done.
Initializing Devices 0-3 on core 0…Done.
Initializing Devices 0-3 on core 1…Done.
Initializing Devices 0-3 on core 2…Done.
Initializing Devices 0-3 on core 3…Done.
Initializing Devices 0-3 on core 4…Done.
Initializing Devices 0-3 on core 5…Done.

I used openmpi-4.0.6 and compiled it from the source code. I don’t know what is openPBS TM, so I think I didn’t use that.

Thank you @tpan1039 , could you please try the below script and check whether it makes any difference.

#PBS -q gpu
#PBS -l walltime=168:00:00
#PBS -l select=1:ncpus=24:ngpus=1:mem=48gb:mpiprocs=24:host=g6:ompthreads=1
cd $PBS_O_WORKDIR
module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2
mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_pbs.dat

Hi @adarsh , the script you post use ony 1 gpu card. The env output is:

LAMMPS (29 Sep 2021)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.5200000 3.5200000 3.5200000
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (176.00000 176.00000 176.00000)
3 by 2 by 4 MPI processor grid
reading atoms …
500000 atoms
reading velocities …
500000 velocities
read_data CPU = 1.183 seconds

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

  • GPU package (short-range, long-range and three-body potentials):
    The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE


  • Using acceleration for eam/alloy:
  • with 24 proc(s) per device.
  • Horizontal vector operations: ENABLED
  • Shared memory system: No

Device 0: NVIDIA GeForce RTX 3080, 68 CUs, 4.8/9.8 GB, 1.7 GHZ (Mixed Precision)

And the performance is 3.861 ns/day, 6.216 hours/ns, 44.686 timesteps/s
88.4% CPU use with 24 MPI tasks x 1 OpenMP threads

And also I used the following script using 4 gpu cards in one chunk:
#PBS -q gpu
#PBS -l walltime=168:00:00
#PBS -l select=1:ncpus=24:ngpus=4:mem=48gb:mpiprocs=24:host=g6:ompthreads=1

The output is:
LAMMPS (29 Sep 2021)
using 1 OpenMP thread(s) per MPI task
Lattice spacing in x,y,z = 3.5200000 3.5200000 3.5200000
Reading data file …
orthogonal box = (0.0000000 0.0000000 0.0000000) to (176.00000 176.00000 176.00000)
3 by 2 by 4 MPI processor grid
reading atoms …
500000 atoms
reading velocities …
500000 velocities
read_data CPU = 1.199 seconds

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE

Your simulation uses code contributions which should be cited:

  • GPU package (short-range, long-range and three-body potentials):
    The log file lists these citations in BibTeX format.

CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE-CITE


  • Using acceleration for eam/alloy:
  • with 6 proc(s) per device.
  • Horizontal vector operations: ENABLED
  • Shared memory system: No

Device 0: NVIDIA GeForce RTX 3080, 68 CUs, 8.5/9.8 GB, 1.7 GHZ (Mixed Precision)
Device 1: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 2: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)
Device 3: NVIDIA GeForce RTX 3080, 68 CUs, 1.7 GHZ (Mixed Precision)

Initializing Device and compiling on process 0…Done.
Initializing Devices 0-3 on core 0…Done.
Initializing Devices 0-3 on core 1…Done.
Initializing Devices 0-3 on core 2…Done.
Initializing Devices 0-3 on core 3…Done.
Initializing Devices 0-3 on core 4…Done.
Initializing Devices 0-3 on core 5…Done.

The performance is: 5.413 ns/day, 4.434 hours/ns, 62.652 timesteps/s
91.2% CPU use with 24 MPI tasks x 1 OpenMP threads

This performance has no difference with the previous one using 4 chunks

Hi, I recompiled openmpi from the source code with openpbs TM. And I did more test with or without pbs.
I found that this job performance only has difference on the job with GPU. There is no difference of performance if I run lammps on pure CPU with or without PBS.

I tested the job running on one GPU card with PBS by the following script:

#PBS -q gpu
#PBS -l walltime=168:00:00
#PBS -l select=1:ncpus=24:ngpus=1:mem=48gb:mpiprocs=24:ompthreads=1:host=g6

cd $PBS_O_WORKDIR

module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2

mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_pbs_gpu_1.dat

The performance with PBS on one GPU card is: Performance: 3.854 ns/day, 6.228 hours/ns, 44.605 timesteps/s. 88.5% CPU use with 24 MPI tasks x 1 OpenMP threads

And I also test the same job running on one GPU card without PBS by bash shell:

CUDA_VISIBLE_DEVICES=“0”
module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2
mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_direct_gpu_1.dat

The performance without PBS on one GPU card is: Performance: 4.737 ns/day, 5.067 hours/ns, 54.824 timesteps/s. 85.7% CPU use with 24 MPI tasks x 1 OpenMP threads

It is evident that my GPU job running with openpbs has a little bit poor performance. I don’t know how this happend. The contents of my pbs_cgroups.json is:

{
“cgroup_prefix” : “pbs_jobs”,
“exclude_hosts” : [],
“exclude_vntypes” : [“no_cgroups”],
“run_only_on_hosts” : [],
“periodic_resc_update” : true,
“vnode_per_numa_node” : false,
“online_offlined_nodes” : true,
“use_hyperthreads” : true,
“ncpus_are_cores” : false,
“cgroup” : {
“cpuacct” : {
“enabled” : true,
“exclude_hosts” : [],
“exclude_vntypes” : []
},
“cpuset” : {
“enabled” : true,
“exclude_cpus” : [],
“exclude_hosts” : [],
“exclude_vntypes” : [],
“mem_fences” : true,
“mem_hardwall” : false,
“memory_spread_page” : false
},
“devices” : {
“enabled” : true,
“exclude_hosts” : [],
“exclude_vntypes” : [],
“allow” : [
“c 195:* m”,
[“infiniband/rdma_cm”,“rwm”],
[“fuse”,“rwm”],
[“net/tun”,“rwm”],
[“tty”,“rwm”],
[“ptmx”,“rwm”],
[“console”,“rwm”],
[“null”,“rwm”],
[“zero”,“rwm”],
[“full”,“rwm”],
[“random”,“rwm”],
[“urandom”,“rwm”],
[“cpu/0/cpuid”,“rwm”,"*"],
[“nvidia-modeset”, “rwm”],
[“nvidia-uvm”, “rwm”],
[“nvidia-uvm-tools”, “rwm”],
[“nvidiactl”, “rwm”]
]
},
“hugetlb” : {
“enabled” : false,
“exclude_hosts” : [],
“exclude_vntypes” : [],
“default” : “0MB”,
“reserve_percent” : 0,
“reserve_amount” : “0MB”
},
“memory” : {
“enabled” : true,
“exclude_hosts” : [],
“exclude_vntypes” : [],
“soft_limit” : false,
“default” : “256MB”,
“reserve_percent” : 0,
“reserve_amount” : “64MB”
},
“memsw” : {
“enabled” : false,
“exclude_hosts” : [],
“exclude_vntypes” : [],
“default” : “256MB”,
“reserve_percent” : 0,
“reserve_amount” : “64MB”
}
}
}

Thank you @tpan1039 for sharing this information and appreciate your analysis/results. I am not still sure how openPBS would affect the performance as it running same set of commands.

Without using openPBS , how this command line will know to use 4 discrete GPU cards ( does it automatically set the CUDA_VISIBLE_DEVICE)

mpirun -np 24 lmp -sf gpu -pk gpu 0 -in in.NiPd > lmp_out_pbs.dat

Sorry, I don’t know how it knows to use 4 cards. Every gpu computing node in our cluster has 4 gpu cards. I think it just take all of the gpu resources to do calculation.

Could you please try the below script:
Reference: 7.4.1. GPU package — LAMMPS documentation
Check: Use the “-sf gpu”
Check: Using the “-pk”

#PBS -q gpu
#PBS -l walltime=168:00:00
#PBS -l select=1:ncpus=24:ngpus=4:mem=48gb:mpiprocs=24:ompthreads=1:host=g6

cd $PBS_O_WORKDIR

module purge
module load lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2

mpirun -np 24 lmp -sf gpu -pk gpu 4 -in in.NiPd > lmp_out_pbs_gpu_1.dat

I test it with -pk gpu 4, and the performance is the same to that of -pk gpu 0

thank you @tpan1039 , sorry do not have a clue on the why the performance variations are seen.

A few more things you could try:

  • Perhaps the environment variables are significantly different between a PBS job and an ssh session. In your PBS job, right before the mpirun command, insert
env | sort > env_pbs

Similarly, in the ssh session, right before executing mpirun, type

env | sort > env_ssh

Then compare them:

diff env_pbs env_ssh
  • Another possibility is that some system process that runs only when PBS is active is stealing the CPU away often enough to slow the program down. You could test this by running using only 20 CPUs, leaving some available for these stray processes. Run PBS and ssh each using -np 20 and see if you still get the big difference between their timings. (Both numbers should be worse than the -np 24 numbers.)

  • Yet another possibility is that the pbs_cgroups hook is somehow causing trouble. If you don’t want to disable the hook everywhere, you could you temporarily disable the hook on your test node (g6) by changing the hook configuration file to have

exclude_hosts [ "g6" ]

(I think–untested)

Another possibility occurred to me:

  • You might have at least one MPI process sharing a core with another MPI process. To check this, when the PBS job is running, ssh into the node and run
ps -Lu root -N -O psr,pcpu,user

Each of the MPI processes should have a unique psr and they should all have about the same pcpu.

Hi, @dtalcott thank you for your reply. I have done more tests. Let me show my discovery at first.

1: The performance difference only occurs on GPU jobs. There is no difference, when I run lammps on pure CPU with or without PBS.
2: This difference only occurs on GPU if I used more than one MPI processes on GPU card, not matter how many cards I used. In other words, If I only run on one CPU with one GPU process, it has no performance difference. BTW, I have compiled my openmpi with cuda aware and openpbs TM.
3: The strange thing is that If I fully occupied the resources on our gpu node with 48 MPI processes, it shows no significant performance differences. i.e. 9.01ns/day vs 8.97ns/day (PS. Every gpu node on our cluster has 24 physical cpu cores, with 48 hyper threads at most.)

I have also compared the environment variable as you mentioned above. But I was unable to figure out what is the significant difference between them, as the results of “diff env_pbs env_ssh” show as follow:

1,2c1,3
< _mlstatus=$?;
< return $_mlstatus

}
}
}
5c6,7
< BASH_FUNC_module%%=() { eval /usr/bin/tclsh /global/software/Modules-5.0.0/libexec/modulecmd.tcl bash "$@";


BASH_FUNC_module%%=() { _module_raw “$@” 2>&1
BASH_FUNC__module_raw%%=() { eval /usr/bin/tclsh /global/software/Modules-5.0.0/libexec/modulecmd.tcl bash "$@";
6a9
C_INCLUDE_PATH=/global/software/libpng-1.2.59-gcc9.2/include:/global/software/GSL-2.7-gcc9.2/include:/global/software/fftw-3.3.10-gcc-9.2-openmpi/include:/global/software/Eigen-3.3.9-gcc-9.2/include/eigen3:/global/software/GCC-9.2/include
10d12
< CUDA_DEVICE_ORDER=PCI_BUS_ID
12d13
< CUDA_VISIBLE_DEVICES=GPU-5a49fab1-7517-e4e4-b3fb-cbc7d6f56b47,GPU-79778d38-81aa-35a2-70c7-50b96908a37b,GPU-8dccabbf-fb20-81f4-9862-cff4fa95a223,GPU-b8a4aa18-ddc0-1e44-ec4f-4a9a1fbf96b5
14,15d14
< C_INCLUDE_PATH=/global/software/libpng-1.2.59-gcc9.2/include:/global/software/GSL-2.7-gcc9.2/include:/global/software/fftw-3.3.10-gcc-9.2-openmpi/include:/global/software/Eigen-3.3.9-gcc-9.2/include/eigen3:/global/software/GCC-9.2/include
< ENVIRONMENT=BATCH
22a22,31
LANG=en_US.UTF-8
LC_ADDRESS=zh_CN.UTF-8
LC_IDENTIFICATION=zh_CN.UTF-8
LC_MEASUREMENT=zh_CN.UTF-8
LC_MONETARY=zh_CN.UTF-8
LC_NAME=zh_CN.UTF-8
LC_NUMERIC=zh_CN.UTF-8
LC_PAPER=zh_CN.UTF-8
LC_TELEPHONE=zh_CN.UTF-8
LC_TIME=zh_CN.UTF-8
23a33,34
LESSCLOSE=/usr/bin/lesspipe %s %s
LESSOPEN=| /usr/bin/lesspipe %s
24a36
LMFILES=/global/software/Modules-5.0.0/modulefiles/gcc/9.2:/global/software/Modules-5.0.0/modulefiles/openmpi/4.0.6-gcc-9.2-cuda-11.0:/global/software/Modules-5.0.0/modulefiles/eigen/3.3.9-gcc-9.2:/global/software/Modules-5.0.0/modulefiles/fftw/3.3.10-gcc-9.2-openmpi-4.0.6:/global/software/Modules-5.0.0/modulefiles/cuda/11.0:/global/software/Modules-5.0.0/modulefiles/gsl/2.7-gcc-9.2:/global/software/Modules-5.0.0/modulefiles/libpng/1.2.59-gcc-9.2:/global/software/Modules-5.0.0/modulefiles/lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2
26a39,40
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.Z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:.xspf=00;36:
MAIL=/var/mail/pan
27a42
_mlstatus=$?;
29d43
< MODULESHOME=/global/software/Modules-5.0.0
30a45,50
MODULESHOME=/global/software/Modules-5.0.0
__MODULES_LMALTNAME=gcc/9.2&as|gcc/default&as|gcc/latest:openmpi/4.0.6-gcc-9.2-cuda-11.0&as|openmpi/default&as|openmpi/latest:eigen/3.3.9-gcc-9.2&as|eigen/default&as|eigen/latest:fftw/3.3.10-gcc-9.2-openmpi-4.0.6&as|fftw/default&as|fftw/latest:cuda/11.0&as|cuda/default&as|cuda/latest:gsl/2.7-gcc-9.2&as|gsl/default&as|gsl/latest:libpng/1.2.59-gcc-9.2&as|libpng/default&as|libpng/latest:lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2&as|lammps/default&as|lammps/latest
__MODULES_LMCONFLICT=gcc/9.2&gcc:openmpi/4.0.6-gcc-9.2-cuda-11.0&openmpi:eigen/3.3.9-gcc-9.2&eigen:fftw/3.3.10-gcc-9.2-openmpi-4.0.6&fftw:cuda/11.0&cuda:gsl/2.7-gcc-9.2&gsl:libpng/1.2.59-gcc-9.2&libpng:lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2&lammps
__MODULES_LMPREREQ=lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2&gcc/9.2&openmpi/4.0.6-gcc-9.2-cuda-11.0&eigen/3.3.9-gcc-9.2&fftw/3.3.10-gcc-9.2-openmpi-4.0.6&cuda/11.0&gsl/2.7-gcc-9.2&libpng/1.2.59-gcc-9.2
__MODULES_LMTAG=gcc/9.2&auto-loaded:openmpi/4.0.6-gcc-9.2-cuda-11.0&auto-loaded:eigen/3.3.9-gcc-9.2&auto-loaded:fftw/3.3.10-gcc-9.2-openmpi-4.0.6&auto-loaded:cuda/11.0&auto-loaded:gsl/2.7-gcc-9.2&auto-loaded:libpng/1.2.59-gcc-9.2&auto-loaded
__MODULES_SHARE_MANPATH=:1
35d54
< NCPUS=1
37,58c56
< OMP_NUM_THREADS=1
< PATH=/global/software/libpng-1.2.59-gcc9.2/bin:/global/software/GSL-2.7-gcc9.2/bin:/global/software/fftw-3.3.10-gcc-9.2-openmpi/bin:/global/software/OpenMpi-4.0.6-gcc9.2-cuda11.0/bin:/global/software/GCC-9.2/bin:/global/software/lammps-2021-Sep29-gpu/bin:/global/software/Modules-5.0.0/bin:/bin:/usr/bin:/snap/bin:/usr/local/bin:/home/pan/bin:/global/software/cuda-11.0/bin
< PBS_ENVIRONMENT=PBS_BATCH
< PBS_JOBCOOKIE=4540711662BE5B30719DA27B29B4E091
< PBS_JOBDIR=/home/pan
< PBS_JOBID=811.l0
< PBS_JOBNAME=lmp_tt.pbs
< PBS_MOMPORT=15003
< PBS_NODEFILE=/var/spool/pbs/aux/811.l0
< PBS_NODENUM=0
< PBS_O_HOME=/home/pan
< PBS_O_HOST=g3.dynstar
< PBS_O_LANG=en_US.UTF-8
< PBS_O_LOGNAME=pan
< PBS_O_MAIL=/var/mail/pan
< PBS_O_PATH=/global/software/libpng-1.2.59-gcc9.2/bin:/global/software/GSL-2.7-gcc9.2/bin:/global/software/fftw-3.3.10-gcc-9.2-openmpi/bin:/global/software/OpenMpi-4.0.6-gcc9.2-cuda11.0/bin:/global/software/GCC-9.2/bin:/global/software/lammps-2021-Sep29-gpu/bin:/global/software/Modules-5.0.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/snap/bin:/usr/local/bin:/home/pan/bin:/global/software/cuda-11.0/bin
< PBS_O_QUEUE=gpu
< PBS_O_SHELL=/bin/bash
< PBS_O_SYSTEM=Linux
< PBS_O_WORKDIR=/home/pan/lmp_test
< PBS_QUEUE=gpu
< PBS_TASKNUM=1


PATH=/global/software/libpng-1.2.59-gcc9.2/bin:/global/software/GSL-2.7-gcc9.2/bin:/global/software/fftw-3.3.10-gcc-9.2-openmpi/bin:/global/software/OpenMpi-4.0.6-gcc9.2-cuda11.0/bin:/global/software/GCC-9.2/bin:/global/software/lammps-2021-Sep29-gpu/bin:/global/software/Modules-5.0.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/snap/bin:/usr/local/bin:/home/pan/bin:/global/software/cuda-11.0/bin
62,63c60
< SHELL=/bin/bash
< SHLVL=2


return $_mlstatus
65c62,67
< TMPDIR=/var/tmp/pbs.811.l0


SHELL=/bin/bash
SHLVL=1
SSH_CLIENT=172.16.12.100 42548 22
SSH_CONNECTION=172.16.12.100 42548 172.16.2.8 22
SSH_TTY=/dev/pts/0
TERM=xterm
67d68
< XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop
69,76c70
< LMFILES=/global/software/Modules-5.0.0/modulefiles/gcc/9.2:/global/software/Modules-5.0.0/modulefiles/openmpi/4.0.6-gcc-9.2-cuda-11.0:/global/software/Modules-5.0.0/modulefiles/eigen/3.3.9-gcc-9.2:/global/software/Modules-5.0.0/modulefiles/fftw/3.3.10-gcc-9.2-openmpi-4.0.6:/global/software/Modules-5.0.0/modulefiles/cuda/11.0:/global/software/Modules-5.0.0/modulefiles/gsl/2.7-gcc-9.2:/global/software/Modules-5.0.0/modulefiles/libpng/1.2.59-gcc-9.2:/global/software/Modules-5.0.0/modulefiles/lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2
< __MODULES_LMALTNAME=gcc/9.2&as|gcc/default&as|gcc/latest:openmpi/4.0.6-gcc-9.2-cuda-11.0&as|openmpi/default&as|openmpi/latest:eigen/3.3.9-gcc-9.2&as|eigen/default&as|eigen/latest:fftw/3.3.10-gcc-9.2-openmpi-4.0.6&as|fftw/default&as|fftw/latest:cuda/11.0&as|cuda/default&as|cuda/latest:gsl/2.7-gcc-9.2&as|gsl/default&as|gsl/latest:libpng/1.2.59-gcc-9.2&as|libpng/default&as|libpng/latest:lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2&as|lammps/default&as|lammps/latest
< __MODULES_LMCONFLICT=gcc/9.2&gcc:openmpi/4.0.6-gcc-9.2-cuda-11.0&openmpi:eigen/3.3.9-gcc-9.2&eigen:fftw/3.3.10-gcc-9.2-openmpi-4.0.6&fftw:cuda/11.0&cuda:gsl/2.7-gcc-9.2&gsl:libpng/1.2.59-gcc-9.2&libpng:lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2&lammps
< __MODULES_LMPREREQ=lammps/2021-Sep-29-gpu-cuda-11.0-gcc-9.2&gcc/9.2&openmpi/4.0.6-gcc-9.2-cuda-11.0&eigen/3.3.9-gcc-9.2&fftw/3.3.10-gcc-9.2-openmpi-4.0.6&cuda/11.0&gsl/2.7-gcc-9.2&libpng/1.2.59-gcc-9.2
< __MODULES_LMTAG=gcc/9.2&auto-loaded:openmpi/4.0.6-gcc-9.2-cuda-11.0&auto-loaded:eigen/3.3.9-gcc-9.2&auto-loaded:fftw/3.3.10-gcc-9.2-openmpi-4.0.6&auto-loaded:cuda/11.0&auto-loaded:gsl/2.7-gcc-9.2&auto-loaded:libpng/1.2.59-gcc-9.2&auto-loaded
< __MODULES_SHARE_MANPATH=:1
< }
< }


XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop

At last, I noticed that the results of “ps -Lu root -N -O psr,pcpu,user” show some lmp command has the same PID and the same PSR. I have no idea what does it mean. For example:

10081 11 80.4 pan R ? 00:01:03 lmp -sf gpu -pk gpu 4 -in in.NiPd -log lmp_out_pbs_gpu.dat
10081 25 0.0 pan S ? 00:00:00 lmp -sf gpu -pk gpu 4 -in in.NiPd -log lmp_out_pbs_gpu.dat
10081 13 0.0 pan S ? 00:00:00 lmp -sf gpu -pk gpu 4 -in in.NiPd -log lmp_out_pbs_gpu.dat
10081 29 0.0 pan S ? 00:00:00 lmp -sf gpu -pk gpu 4 -in in.NiPd -log lmp_out_pbs_gpu.dat
10081 29 0.0 pan S ? 00:00:00 lmp -sf gpu -pk gpu 4 -in in.NiPd -log lmp_out_pbs_gpu.dat
10081 25 0.0 pan S ? 00:00:00 lmp -sf gpu -pk gpu 4 -in in.NiPd -log lmp_out_pbs_gpu.dat

This certainly is a puzzle! Things to try:

  • At this point, I think we need to distinguish between plain PBS and PBS with pbs_cgroups hook. The hook sets several limits on how the job runs, including some related to memory management. It would be good to see if these are part of the problem. So run it once with pbs_cgroups disabled.

  • Next is to distinguish between batch and interactive. For this, use the -I option (capital i) to qsub to get an interactive session

qsub -I job_script

Where job_script is the path to your job. When the job starts, you’ll have an interactive session on the node. Run your script with

bash job_script
  • It appears your application runs for only about a minute? In that case, duplicate the mpirun line in the job so it runs twice, one after the other. It might tell us something if the timings for the successive runs are very different.

==
Your ps output shows that lmp starts multiple threads, in spite of OMP_NUM_THREADS=1. It also shows that some of these get placed on the secondary hardware threads (psr > 23)

Try a different ps to check that all the base threads for each MPI process get placed on the primary core thread:

ps -u pan -O psr,pcpu,user

Again, look for MPI processes assigned to the same psr. Also, for any assigned a psr > 23.

==
As an aside, to paste computer output to the forum, first type ``` on a line by itself, then paste the output, then type another ``` line. This would have made the diff output easier to read.

You are enabling cpusets and device isolation but have vnode_per_numa_node set to false, which can lead to GPU jobs using a GPU on the other socket from where the CPUs are allocated. If e.g. the first GPU device listed is tied to a socket with high numbered CPUs it’s not going to work well in many cases. In all cases, try to determine after inspecting CUDA_VISIBLE_DEVICES and /sys/fs/cgroup/cpuset/…/cpus whether the GPU is indeed located on the same socket as the CPUs.

Perhaps you should disable cpuset and device isolation and grab all GPUs from the job to see if the difference between the PBS job and running it interactively disappears.

If so, you’ll need to understand how to correctly place jobs (and request the correct number of GPUs). With vnode_per_numa_node set to true, cpusets and device isolation it is certainly possible to get things to work well (even if you want to use GPUs on two sockets you can use two chunks and vscatter to get CPUs and GPUs on both sockets). The advantage is that once it works you can control performance a lot better once you’re sharing the node between more than one job.

@dtalcott @alexis.cousein Hi, I am sorry to reply you so late due to the heavy epidemic situation in my city. I have tested the interactive session through PBS. It turned out the performance has no differences with that of the PBS script job. Fortunately, I found that this special performance difference
was caused by my cgroups hook setting. If I disable pbs_cgroups hook in g6 by modifying “exclude_hosts = [ 'g6 ’ ]”, this performance difference disappears.
However, I don’t know how to

disable cpuset and device isolation and grab all GPUs from the job

I tried to put the block of device into the cpuset block, and found that gpu schedule can not work.
Actually, I set vnode_per_numa_node to be true at the begining, but I found that would cause some bugs in the resources_available.vnode , this attribute of some nodes might change to be <various> and no job can be submitted to those nodes with this strange value of vnode. After I forced vnode_per_numa_node to be false, this situation did not happen anymore.

I am not an expert with either cgroups or GPUs, so this is just guessing, but you might try re-enabling the hook on g6. After setting vnode_per_numa_node so it is true only on g6:

vnode_per_numa_node: "host in: g6"

Then, use pbsnodes -av to list vnode information, which should report separate vnodes only for g6.

Next, change your select statement to

#PBS -l select=2:ncpus=12:ngpus=2:mem=24gb:mpiprocs=12:host=g6:ompthreads=1
#PBS -l place=vscatter

Here, I’m assuming the g6 node has two sockets, each with 12 CPU cores and 2 GPUs.

The idea is to group cores, memory, and GPUs so most references stay on the same socket, rather than needing to go off-socket.