I have installed openpbs on a OpenHPC 1.3X with xCAT setup , I earlier used torque in which i am facing the same issue ,
The issue is that when i qsub a job ,
It produces a error file stating module not found ,
so i’m unable to load modules on the computes but when i login from the master to the computes and type in module ,
It doesn’t produce any problems executing it ,
I tried sourcing lmod.sh , did not work .
PFA the screenshots
Please note, you need to try the batch command lines from on the compute nodes manually and see whether they work ( i mean from export line to echo $a , execute them one by one on the compute node and check whether they run without any issues)
Please try the below batch script if you are using openpbs. You should check the stdout and stderr files created in the job submission directory against each job.
Dear Adarsh,
I did passwordless ssh on the computes and now i see this and synced the users.
Any package im missing ,
In regards to openhpc , there is a conflict with enviornment-modules and lmod .
Ill need lmod for loading modules because it requires modular approach .
on compute1 , as “jai” , can you run the below one after the other and check whether they run without any issues. The problem here is module is command is not found, i am not sure whether we are sourcing the correct module init file in the script or PATH has the module path included.
Dear Adarsh,
PFA the screenshot from the compute node ,
If i qsub on the compute i get ,
Now if i ssh to master ,
It is able to login ,
I also have /etc/hosts identical on all master and computes .
user jai is able to connect without password but there is a NIS setup required ,
As Suggested ,
I ran a simple bash script ,
No error whatsoever,
I’m really not sure why module command is not working ,
Its a function and not a binary as far as i know ,
so if i do type -a module ,
it shows that it maps to modulecmd .
if i use it on the compute as a command it works without any issues ,
Module command works on the computes as well ,
which file needs to be called in the pbs so that it takes module as a command is what my question is ,
declaration of a function doesn’t work as well .
It says modulecmd not found .
I would require module command as it is the one used to load the modules such as intel etc.
As we can see here ,
We have module is running on the computes .
I might have forget to mention ,
Bash script works !
But it doesnt find the module command on the computes,
which i find very wierd due to the fact it is available ,
Any additional PATH’s i need to add .
I also checked for which module in the compute and i find the same output as you see in the screenshot
I wouldn’t expect the modules initialization to be sensitive to whether it was running in a batch or interactive session. Note that PBS sets some environment variables for you, including ENVIRONMENT=BATCH.
On my local workstation I have the file /etc/profile.d/modules.sh that contains these lines:
shell=$(/usr/bin/basename $(/bin/ps -p $$ -ocomm=))
if [ -f /usr/share/modules/init/$shell ]; then
. /usr/share/modules/init/$shell
else
. /usr/share/modules/init/sh
fi
You can test whether or not it’s working by submitting an interactive job with “qsub -I” (that is a capital “eye”). Once you get a shell prompt, you can further explore your environment without having to repeatedly modify a job script.
Because I’m pretty sure modules.sh isn’t available on my master and computes or that I am missing any package ?
There was a clash with module and environment variables in OpenHPC forums ?
Is it related ?
Or I simply should create this file ?
If you want all users to have modules enabled, then it’s best to add the file to /etc/profile.d
If you just want to experiment with it yourself you may update your ~/.bashrc or ~/.profile
Looking back at your initial post, that looks like a problem returning the job out put from the compute node. Did you get that resolved?
Yes , i got that resolved , job output from bash runs properly , It had the issue with syncing of users on the computes, i forget to run syncusers script via xdcp compute -F /path/to/where/the/syncusers script is .
After much research and scratching my head for a long time ,
I don’t have the /usr/share/modules/init/sh file
But , I see a directory /opt/ohpc/admin/lmod/8.1.18/init wherein i have the shells which ideally should have been in /usr/share/modules/init/
Now ,
Another thing i’ve noticed is that when i qsub -I ,
computes don’t take the LMOD_CMD variable ?
Which i’m not sure why ?
modules.sh is not available on my master and computes ,
so including the lmod path (ie:/opt/ohpc/admin/lmod/8.1.18/init) in /etc/bashrc instead of modules.sh should do my job?
And i also don’t understand why some variables on the master are not taken by the compute ?
Dear Micheal,
I Tried adding the script you gave with the nessecary changes of lmod(which i’m not really sure is correct, I’m guessing here ) but now when i try to echo the computes via qsub -i
module command runs , but i’m not sure if it is the right one because it doesn’t produce any output .
The OHPC package is self contained under /opt/ohpc so you usually don’t want both the system lmod package and OHPC lmod package installed at the same time. The purpose of OHPC is to make this collection of HPC tools available with a minimum amount of effort for the administrator, as opposed to having to download, compile, and configure them individually as had been done traditionally. The OHPC recipes are well written and should be followed closely.
It’s a bit hard to follow the screenshots you provided. It would be much easier to cut and paste the text of the session in future responses.
You appear to have “master” working as you want it to, with $LMOD_CMD pointing to the OHPC instance of lmod. In your interactive job, LMOD_CMD isn’t set on compute1. That appears to be the problem, correct?
If you are unable to get the system to set LMOD_CMD in the user sessions, you can tell PBS to pass it on to the job when submitted. For example…
qsub -I -v LMOD_CMD
or if you want to share all of the environment vairiables…
That is a known problem with the older BASH version of the profile.d script that initializes modules.
It defines modules as a bash function. When you run a job, it starts a login shell, and the profile.d script then defines the function. The bash shell then executes your script, and that starts another shell. That other shell inherits its environment variables (which prevents the profile.d script from initialising modules once more) but it doesn’t inherit the function.
Note that in more recent versions of bash and environment modules, you no longer have this problem, since the /usr/share/Modules/init/bash will actually export the functions and in subshells re-import them if needed (instead of just thinking ther eis ‘nothing to be done’).
This is on RedHat 8 with the environment-modules-4.5.2-1.el8.x86_64 rpm installed:
[alexis@dragon ~]$ qsub
Job script will be read from standard input. Submit with CTRL+D.
module list
4395.dragon
[alexis@dragon ~]$ cat STDIN.o4395
[alexis@dragon ~]$ cat STDIN.e4395
No Modulefiles Currently Loaded.
The workaround is to initialize modules in /etc/bashrc (which will initialize it in all bash shells), or to modify the profile.d script for modules initialisation (to either do the initialisation again regardless of what is found in the environment, or to define the missing function if it does not exist).
Obviously the latter is more parsimonious – only initialise modules if “module --version” returns an error and you’ll only initialise it if needed.
Unfortunately, what you need to do exactly depends on the version of environment-modules and bash you’re using; on, mine you don’t need to do anything. On yours evidently you do. You’ll have to read the /etc/profile.d script and the /usr/share/Modules/init/bash (assuming that is the path, but see the profile.d script to check that) to see exactly what is happening and how to fix it.