Modules unavailable when loading from PBS script

Hello,

We are experiencing an issue where when we try and load modules within a PBS script, we are getting “command not found” for the module command. See below.
[randerson10@login0002 Python_examples]$ qsub basic_python.pbs
40.wwadm01
[randerson10@login0002 Python_examples]$ cat test.e40
/var/spool/pbs/mom_priv/jobs/40.wwadm01.SC: line 14: module: command not found
/var/spool/pbs/mom_priv/jobs/40.wwadm01.SC: line 18: python: command not found

We are able to run the module command from the shell with regular users on the compute nodes and all our software shows up and loads flawlessly.
[randerson10@node0001 ~]$ module load go
[randerson10@node0001 ~]$ module list
Currently Loaded Modules:

  1. python/3.8.6-intel-uly7 2) go/1.19.1

Both $PATH’s for regular users and root contain the locations for our modules.

Is there any explanation for this behavior? Modules unable to load when trying to be loaded by scripts

It appears the shell environment is not being set as expected for your job. My guess is that the shell startup files are not being run. To help diagnose this, edit your basic_python.pbs script so that the first executable lines are

/usr/bin/ps -fu randerson10
/usr/bin/env

Also, just to verify which module command you are expecting, on your terminal run

module --version

+1 @dtalcott

Also you can try to launch interactive test job and check whether you can load module

Test1 : qsub -I
Test2 : qsub -V -I

You could try to use the absolute path!?

Adding /usr/bin/ps -fu randerson10 and /usr/bin/env had not affect.

The module --version output just is the following. module --avail, list, load is what is expected.
Modules based on Lua: Version 8.7.32 2023-08-28 12:42 -05:00
by Robert McLay mclay@tacc.utexas.edu

I am using OpenHPC with LMOD, and the lmod.sh withing /etc/profile.d/ file is below. It looks alright to my eye and it has not been messed with so it is the default that comes with.

[root@login0002 init]# cat /etc/profile.d/lmod.sh

#!/bin/sh

-- shell-script --

########################################################################

This is the system wide source file for setting up

modules:

########################################################################

NOOP if running under known resource manager

if [ -n “$SLURM_NODELIST” ] || [ -n “$PBS_NODEFILE” ]; then

return

fi

export LMOD_SETTARG_CMD=“:”

export LMOD_FULL_SETTARG_SUPPORT=no

export LMOD_COLORIZE=no

export LMOD_PREPEND_BLOCK=normal

if [ $EUID -eq 0 ]; then

export MODULEPATH=/opt/ohpc/admin/modulefiles:/opt/ohpc/pub/modulefiles

else

export MODULEPATH=/opt/ohpc/pub/modulefiles

fi

export BASH_ENV=/opt/ohpc/admin/lmod/lmod/init/bash

Initialize modules system

. /opt/ohpc/admin/lmod/lmod/init/bash >/dev/null

Load baseline OpenHPC environment

module try-add ohpc

I’m not familiar with ohpc and LMOD, but the following from /etc/profile.d/lmod.sh looks suspect:

Given that PBS sets PBS_NODEFILE, this says that lmod deliberately skips setting up modules when running under PBS. I have no idea why.

As adarsh suggested, if you run an interactive job (qsub -I), is the module command available?

In any case, put the following near the top of your job script and see if it helps:

source /opt/ohpc/admin/lmod/lmod/init/bash >/dev/null

So running an interactive job on the node is not working as you thought.

I am unable to load any modules because the module command is not available.

I tried putting source /opt/ohpc/admin/lmod/lmod/init/bash >/dev/null but that is not fixing the problem of initializing the module environment.

The lmod.sh script I referenced above was only on the login node, but placing it on the compute nodes now allows me to run an interactive job and then the module command is available. However, when submitting a non-interactive job, my job script error output still comes back with module command not found.

Any ideas as to why that would be?

Using google, I found that this is a known issue with openhpc.

https://lists.openhpc.community/g/users/topic/18100521

My suggestion: Make a copy of /etc/profile.d/lmod.sh, but remove the three lines mentioned above:

if [ -n “$SLURM_NODELIST” ] || [ -n “$PBS_NODEFILE” ]; then
    return
fi

Then, at the start of your pbs script source that copy:

source my_lmod.sh

What is happening is that lmod is expecting PBS to forward your entire environment when you qsub the job. This is not default behavior, but you can force it by adding -V to the qsub arguments. However, because no other shell startup scripts have a similar behavior, you’ll end up with a mixture of the qsub environment and the node environment.

Ahhh I see. Yes, those lines definitely looked to be part of the problem.

I commented out that conditional and sources the lmod.sh file in the job script.

I am now getting the following error when submitting a basic python job. I believe the module environment is still not getting processed correctly.

/var/spool/pbs/mom_priv/jobs/99.wwadm01.SC: line 20: 343688 Illegal instruction (core dumped) python hello.py

My entire job script is here.

!/bin/bash
#PBS -q default
#PBS -N test

serial jobs: ONLY 1 processor core is requested

#PBS -l select=1:mem=2gb:ncpus=1
#PBS -l walltime=08:00:00
#PBS -m abe
#PBS -W group_list=x-ccast-prj-saula
#PBS -o test

source /etc/profile.d/lmod.sh

module load python/3.8.6-gcc-2pmf

cd $PBS_O_WORKDIR

python hello.py