Modules unavailable when loading from PBS script

ryan · May 28, 2024, 8:53pm

Hello,

We are experiencing an issue where when we try and load modules within a PBS script, we are getting “command not found” for the module command. See below.
[randerson10@login0002 Python_examples]$ qsub basic_python.pbs
40.wwadm01
[randerson10@login0002 Python_examples]$ cat test.e40
/var/spool/pbs/mom_priv/jobs/40.wwadm01.SC: line 14: module: command not found
/var/spool/pbs/mom_priv/jobs/40.wwadm01.SC: line 18: python: command not found

We are able to run the module command from the shell with regular users on the compute nodes and all our software shows up and loads flawlessly.
[randerson10@node0001 ~]$ module load go
[randerson10@node0001 ~]$ module list
Currently Loaded Modules:

python/3.8.6-intel-uly7 2) go/1.19.1

Both $PATH’s for regular users and root contain the locations for our modules.

Is there any explanation for this behavior? Modules unable to load when trying to be loaded by scripts

dtalcott · May 29, 2024, 12:24am

It appears the shell environment is not being set as expected for your job. My guess is that the shell startup files are not being run. To help diagnose this, edit your basic_python.pbs script so that the first executable lines are

/usr/bin/ps -fu randerson10
/usr/bin/env

Also, just to verify which module command you are expecting, on your terminal run

module --version

adarsh · May 29, 2024, 6:36am

+1 @dtalcott

Also you can try to launch interactive test job and check whether you can load module

Test1 : qsub -I
Test2 : qsub -V -I

Bert · May 29, 2024, 12:47pm

You could try to use the absolute path!?

ryan · May 29, 2024, 1:35pm

Adding /usr/bin/ps -fu randerson10 and /usr/bin/env had not affect.

The module --version output just is the following. module --avail, list, load is what is expected.
Modules based on Lua: Version 8.7.32 2023-08-28 12:42 -05:00
by Robert McLay mclay@tacc.utexas.edu

I am using OpenHPC with LMOD, and the lmod.sh withing /etc/profile.d/ file is below. It looks alright to my eye and it has not been messed with so it is the default that comes with.

[root@login0002 init]# cat /etc/profile.d/lmod.sh

#!/bin/sh

-- shell-script --

########################################################################

This is the system wide source file for setting up

modules:

########################################################################

NOOP if running under known resource manager

if [ -n “$SLURM_NODELIST” ] || [ -n “$PBS_NODEFILE” ]; then

return

fi

export LMOD_SETTARG_CMD=“:”

export LMOD_FULL_SETTARG_SUPPORT=no

export LMOD_COLORIZE=no

export LMOD_PREPEND_BLOCK=normal

if [ $EUID -eq 0 ]; then

export MODULEPATH=/opt/ohpc/admin/modulefiles:/opt/ohpc/pub/modulefiles

else

export MODULEPATH=/opt/ohpc/pub/modulefiles

fi

export BASH_ENV=/opt/ohpc/admin/lmod/lmod/init/bash

Initialize modules system

. /opt/ohpc/admin/lmod/lmod/init/bash >/dev/null

Load baseline OpenHPC environment

module try-add ohpc

dtalcott · May 29, 2024, 11:34pm

I’m not familiar with ohpc and LMOD, but the following from /etc/profile.d/lmod.sh looks suspect:

Given that PBS sets PBS_NODEFILE, this says that lmod deliberately skips setting up modules when running under PBS. I have no idea why.

As adarsh suggested, if you run an interactive job (qsub -I), is the module command available?

In any case, put the following near the top of your job script and see if it helps:

source /opt/ohpc/admin/lmod/lmod/init/bash >/dev/null

ryan · May 30, 2024, 1:27pm

So running an interactive job on the node is not working as you thought.

I am unable to load any modules because the module command is not available.

I tried putting source /opt/ohpc/admin/lmod/lmod/init/bash >/dev/null but that is not fixing the problem of initializing the module environment.

The lmod.sh script I referenced above was only on the login node, but placing it on the compute nodes now allows me to run an interactive job and then the module command is available. However, when submitting a non-interactive job, my job script error output still comes back with module command not found.

Any ideas as to why that would be?

dtalcott · May 31, 2024, 4:15am

Using google, I found that this is a known issue with openhpc.

https://lists.openhpc.community/g/users/topic/18100521

My suggestion: Make a copy of /etc/profile.d/lmod.sh, but remove the three lines mentioned above:

if [ -n “$SLURM_NODELIST” ] || [ -n “$PBS_NODEFILE” ]; then
    return
fi

Then, at the start of your pbs script source that copy:

source my_lmod.sh

What is happening is that lmod is expecting PBS to forward your entire environment when you qsub the job. This is not default behavior, but you can force it by adding -V to the qsub arguments. However, because no other shell startup scripts have a similar behavior, you’ll end up with a mixture of the qsub environment and the node environment.

ryan · June 3, 2024, 1:20pm

Ahhh I see. Yes, those lines definitely looked to be part of the problem.

I commented out that conditional and sources the lmod.sh file in the job script.

I am now getting the following error when submitting a basic python job. I believe the module environment is still not getting processed correctly.

/var/spool/pbs/mom_priv/jobs/99.wwadm01.SC: line 20: 343688 Illegal instruction (core dumped) python hello.py

My entire job script is here.

!/bin/bash
#PBS -q default
#PBS -N test

serial jobs: ONLY 1 processor core is requested

#PBS -l select=1:mem=2gb:ncpus=1
#PBS -l walltime=08:00:00
#PBS -m abe
#PBS -W group_list=x-ccast-prj-saula
#PBS -o test

source /etc/profile.d/lmod.sh

module load python/3.8.6-gcc-2pmf

cd $PBS_O_WORKDIR

python hello.py

Topic		Replies	Views
OpenPBS CentOS 7 Job Submittion Users/Site Administrators	23	1816	March 30, 2021
Setting default $PATH for openPBS jobs Users/Site Administrators	3	1288	September 27, 2021
Submit job but command not found Users/Site Administrators	5	7763	March 26, 2019
PBS job submission problem Users/Site Administrators	2	672	August 15, 2023
Built in shell command not working Users/Site Administrators	10	1206	May 17, 2022