Starting the MoM

Hello everyone,

I am new to PBS Professional, I would like to learn more the principals of the HPC. I need your help here :slight_smile:

I am experiencing problems in running the small cluster (system with 2 nodes) on CentOS7.
After I run the pbs on the compute node (MoM) I get the information:
Starting PBS
… /var/spool/pbs needs updating
Running /opt/pbs/livexec/pbs_habitat to update it.
.
.
*** Invalid entry in /var/spool/pbs/mom_priv/config
*** for clienthost: CHANGE_THIS_TO_PBS_PRO_SERVER_HOSTNAME

even though I changed the name in the /etc/pbs.conf the file PBS_HOME/mom_priv/config stays the same, does not update automatically while running /etc/init.d/pbs
Should I always change the /var/spool/pbs/mom_priv/config manually? I guess it should be changed automatically depending on the entry in the /etc/pbf.conf

If I do so (change manually the entry to the proper name of the server - pbs_head as set in the /etc/hosts file for the server) the output is like this:
Starting PBS
PBS mom

But after runing /etc/init.d/pbs status I get information that “pbs_mom is not running”. Still…
Running manually the pbs_mom also does not eventually start the pbs_mom (pbs status says it is not running, I also don’t see the pbs_mom in the list of processes). The server starts without any problems :confused:

Could you please help me and tell how should I do it properly to set the configuration and start MoM?

Here is the output of pbs_probe:

====== System Information =======

sysname=Linux
nodename=localhost.localdomain
release=3.10.0-327.el7.x86_64
version=#1 SMP Thu Nov 19 22:10:57 UTC 2015
machine=x86_64

====== Problems in PBS EXEC Hierarchy =======

Permission/Ownership Problems:

/opt/pbs/sbin/pbs_mom
(-rwxr-xr-x , root , root) needs to be (-rwx------ , root, group id < 10)

/opt/pbs/bin/nqs2pbs
(-rwxr-xr-x , root , root) needs to be (-rwx------ , root, group id < 10)
Real Path Problems:
/opt/pbs/sbin/pbs-report, No such file or directory

/opt/pbs/etc/pbs_habitat, No such file or directory

/opt/pbs/etc/pbs_init.d, No such file or directory

/opt/pbs/etc/pbs_postinstall, No such file or directory

/opt/pbs/lib/pbs_sched.a, No such file or directory

/opt/pbs/lib/pm, No such file or directory

/opt/pbs/man, No such file or directory

/opt/pbs/tcltk, No such file or directory

/opt/pbs/python, No such file or directory

In order not to blur everything I am writing new post:

What is quite surprising simply after the reboot the MoM process started without any problems automatically (I guess this is how it work) and from now on I am able to stop / start the process without any problems. I don’t know why I should have restarted the system to make it work, there was nothing in the documentation about it but it seems that it is working now :slight_smile:

Not sure what happened. After you changed the name in the config file you would need to restart the pbs services.

Anyway glad to know its working now after a reboot. BTW, pbs_probe is broken in this release of pbspro and the community is helping fix that - so we should have something working soon.

Thanks and Regards,
Subhasis

Thank you for your answer.

Now I am trying to add the nodes to the cluster but unfortunately I get the answer:
qmgr obj=node1 svr=default : Unauthorized Request
qmgr: Error (15007) returned from server

The node1 is the alias for the compute node and ping works properly.
In the guide it is said: Get the short name returned by the gethostname command where you will run the MoM. but how can I actually gethostname on the MoM? Hostname command returns “node1” which seems fine.

I will be really grateful for your help :slight_smile:

In order to add nodes, you must run qmgr as root or as a user with manager privileges. You should do so on the server node itself. If it still fails, you should investigate your network configuration. Do your nodes have multiple network interfaces?

Actually that was ideally the issue here… The privileges while running qmgr. The problem is solved.
If I will have any other issues while managing the MoMs / setting the jobs may I post the questions here or should I set a separate thread?

Thanks @jendker

Absolutely, please post any questions about running PBS in this forum. The community members will love to help with your issues.

Regards,
Subhasis

I would need your help. The information about the nodes with the pbsnodes -a display properly - with the no. of cpus, memory and so on, but when I submit the jobs I receive the answer: qsub: Bad UID for job execution. I found some answers in the internet regarding Torque, but none PBS Pro itself. I am starting the job as root, if I don’t do it the job stays in held mode.

Besides it, could you please tell me if xpbs is available? I could not find it.

The most common reason for seeing that message is because you are submitting your jobs as root. Please try submitting a job as a non-root user.

Sorry, I’ve edited my previous post.

When I am trying to do it as a non-root user the job stays in the held mode and I guess it should be executed by compute nodes as the requirements for the node are set low in the bash file of the job.

You are making progress. A held job is better than no job at all. :slight_smile:

A job may be placed in held state for a number of reasons. Most commonly, there were too many failed attempts to run it. The server gives up after 20 failed attempts to run a job. Take a look at the mom and server logs to confirm that is the case. They should give you an indication of what is going wrong. Also, take a look at PBS_HOME/mom_priv/config to make sure there is a clienthost entry for the server.

xpbs is not available in the open source release of PBS Pro, but we would be very pleased if someone contributed a Python based GUI to take its place!

Hehe thank you :smiley:

What I can see in the log files of mom is repeatedly:
Type 1 request from…
Type 3 request from…
Type 5 request from…
No Password Entry from …
kill_job
node1 cput=0…
no active tasks

And on repeat.
I guess the problem must be with password then. Should I then set the authentication for the SSH connection on the server with the nodes? Or is it something else?

Thank you for help :slight_smile:

Sounds like you’re missing a password entry for the user that submitted the job on the execution host. PBS Pro is calling getpwnam() to lookup the user name, but the call is failing on the execution host. Please ensure the user exists on the execution host.

1 Like

So if I understand correctly I need to create the same user on the execution nodes as this on server from which I submit the jobs? So if I do it on the xyz@psb_head I need to have accounts xyz@node1 and xyz@node2 as well? Or maybe should I use the created pbsdata account for the job submission? Was it the purpose of creating this account? Or is it used for something different? Many questions :slight_smile:

FYI: I’ve already the passwordless ssh.

The execution nodes need to have the same user as which you have submitted the job. This is because eventually PBS will run the job with the credentials of the user who submitted the job. To do that PBS needs to switch to that user on the execution hosts - thus these users need to be present on the execution hosts.

The pbsdata account is only used to operate the embedded postgres database, and not supposed to be used to run jobs.

Keep your questions coming - we love to hear back.

Regards,
Subhasis

Thank you very much @subhasisb, this are really a nice words :slight_smile:

The first jobs start to work as I observe it with the htop :smiley: Even though in the logs I see the that there is the problem with connection. As described on page AG-1041 I set the ssh authorisation with selected way (public keys) and it is working between all the compute nodes and the server, the communication way is set as described in AG-1042 to scp:

/etc/pbs.conf

PBS_SERVER=pbs_head
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/bin/scp

And the segment of the mom_log:

So it looks like it works but in the same timeI can see the logfiles with the errors :slight_smile: How can I ascertain that the batch jobs are executed properly?

HI @jendker

Looks like your job started running and when it finished, PBS mom was trying to stage out the files (output, error files). However, the error messages seem to indicate that the passwordless scp did not work. It tried various authentication methods, but eventually gave up.

The ifl request that failed is 54 (libpbs.h #define PBS_BATCH_CopyFiles 54). Basically this means that the stageout of the stderr, stdout or similar files failed. Since your /etc/pbs.conf mentions /bin/scp that is the tool that was used to copy, so you may need to check the scp configuration.

HTH
Subhasis

Hmm I don’t get it. The comunication between the xyz@pbs_head (which submits job) and all the other xyz users on the nodes works without any problem. Between which users is there the communication, only between users or between the roots? Because I don’t know for which users should I set the passwordless connection: between xyz@nodes or root@nodes? :confused:

Stageout copies the files as the user (i.e. the “euser” attribute of the job), to the host specified in the Output_Path/Error_Path attribute of the job.

If you submit from a location on your cluster with shared storage, please add $usecp lines in the MoM config files to tell MoM to use “plain” cp instead of scp.