PBS-server not running

Hello,

I see that this is an old post, wondering if anybody can help. I installed openPBS and very strangely when I run " sudo /etc/init.d/pbs start " its says:

Starting PBS
PBS comm already running.
PBS scheduler already running.
PBS Server already running.

but when I do “/etc/init.d/pbs status” it shows:

pbs_server is not running
pbs_sched is not running
pbs_comm is not running

I am confused. I run few jobs, they remain in queued and error file shows:

/var/spool/pbs/mom_priv/jobs/2001.PiMaster.SC: line 28: /home/pi/demo/fpi-serial: Permission denied

and output file is:

SSH is enabled and the default password for the ‘pi’ user has not been changed.
This is a security risk - please login as the ‘pi’ user and type ‘passwd’ to set a new password.

Thanks for helping
Best

Could you please check these

  1. SELinux is disabled and if you disable now restart the system
  2. Firewall is disable or allows ports 15001 to 15009 and 17001
  3. All the systems in the clusters should have static IP and hostname (DNS is resolvable) and /etc/hosts on all the systems are populated correctly and resemble the same
  4. make sure the password-less SSH for all the users work without being asked for a password or hostkeychecking
  • PBS Server to compute node
  • Compute node to PBS Server
  • Compute node(s) to Compute node(s)

Please check and share the server / scheduler / comm logs , the tail end of the logs should have some reason for exiting.

Please share your job submission script with information blocked out
It seems there was an issue with permission executing the line 28, which is /home/pi/demo/fpi-serial

Hope this helps

Hello Adarsh and thanks for your help. I am new to installing PBS even though I used it a lot to run my simulations few years ago. I am building a cluster made of Raspberries and need PBS, for now I am testing it before I install all the nodes. Current setting is 1 Master (PiMaster) and 2 nodes (Pi01 and Pi02).

For your points:

  1. I don’t think SELinux is installed

  2. Ports 15001 - 15009 and 17001 here is the output

    pi@PiMaster:/mnt/nfs $ sudo nmap -p 15001-15009 192.168.0.106
    Starting Nmap 7.70 ( https://nmap.org ) at 2021-03-05 08:26 EST
    Nmap scan report for PiMaster (192.168.0.106)
    Host is up (0.000072s latency).

    PORT STATE SERVICE
    15001/tcp open unknown
    15002/tcp closed onep-tls
    15003/tcp closed unknown
    15004/tcp closed unknown
    15005/tcp closed unknown
    15006/tcp closed unknown
    15007/tcp open unknown
    15008/tcp closed unknown
    15009/tcp closed unknown

    Nmap done: 1 IP address (1 host up) scanned in 0.52 seconds

  3. /etc/hosts file is the same on all

127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

192.168.0.106 PiMaster
192.168.0.102 Pi01
192.168.0.103 Pi02

  1. shh between master and nodes and reverse is w/o password.

  2. Can you please advise how to:
    Please check and share the server / scheduler / comm logs , the tail end of the logs should have some reason for exiting.

  3. Submission script is

#PBS -lselect=3:ncpus=4
mpiexec /mnt/nfsshare/helloworld

Thanks again for you help.
Best and stay safe

The script I referred to in my first submission is:

#!/bin/bash

Calculate pi by integrating f(x)=4/(1+x^2)

Request 1 node with 1 free CPU core, since this code uses no parallel features

by default. If Portland compilers are used with the -mautopar flag, change the

ppn value to something between 2 and 8, and adjust the last line of this job file

accordingly.

#PBS -l nodes=1:ppn=1

Reserve 24 hours on selected cores

#PBS -l walltime=24:00:00

Give the job a descriptive name for emails (name must start with a letter)

#PBS -N Finding_pi_serial

Send mail to address given below when the job begins, ends normally, or aborts

#PBS -m bea
#PBS -M mazeh01@hotmail.com

cd $PBS_O_WORKDIR

Single-CPU version

/home/pi/demo/fpi-serial

Example 4-CPU version. Only works if code was compiled with Portland compilers

and if you set the ppn value above to 4 – commented out by default.

./fpi-serial -np 4

Output file is:
SSH is enabled and the default password for the ‘pi’ user has not been changed.
This is a security risk - please login as the ‘pi’ user and type ‘passwd’ to set a new password.

Error file is:
/var/spool/pbs/mom_priv/jobs/3003.PiMaster.SC: line 28: /home/pi/demo/fpi-serial: Permission denied

Strange as it is, the output file is the last statement of the ssh connection:

pi@PiMaster:~/demo $ ssh pi01
Linux Pi01 5.10.17-v7+ #1403 SMP Mon Feb 22 11:29:51 GMT 2021 armv7l

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
Last login: Fri Mar 5 08:17:38 2021 from 192.168.0.103

SSH is enabled and the default password for the ‘pi’ user has not been changed.
This is a security risk - please login as the ‘pi’ user and type ‘passwd’ to set a new password.

pi@Pi01:~ $

Sorry, I don’t know what happened to the large characters, maybe the symbols in the script statements.

Thank you for sharing the above details @nmazeh

  1. You can update your script as below

#!/bin/bash
#PBS -l select=1:ncpus=1
#PBS -l walltime=24:00:00
#PBS -N Finding_pi_serial
#PBS -m abe
#PBS -M mazeh01@hotmail.com
cd $PBS_O_WORKDIR
/home/pi/demo/fpi-serial

Quick query:
0. Please share us the output of these commands

cat /etc/pbs.conf
qstat --version

  1. is /home/pi/demo/fpi-serial accessible on Pi01 and Pi02 ? is /home common across PBS Server and PBS Compute nodes ?
  2. Please check and share whether the services are running on PBS Server and PBS Compute Nodes by running the below command

ps -ef | grep pbs_

Please start the PBS services once more and check systemctl start pbs or /etc/init.d/pbs start

source /etc/pbs.conf
cd $PBS_HOME/server_logs # location of pbs server logs
cd $PBS_HOME/sched_logs # location of pbs sched logs
cd $PBS_HOME/comm_logs #location of pbs_comm logs

On the compute node, the mom logs are seen here ( /etc/init.d/pbs start)

source /etc/pbs.conf
cd $PBS_HOME/mom_logs #location of PBS MOM logs

Adarsh,

Thanks so much for helping me out. I fixed few things about the files and shared directory and run just the simple Helloworld code. I don’t have an error anymore but the output is not complete. Below is what you requested.

cat /etc/pbs.conf
PBS_SERVER=PiMaster
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp

qstat --version
pbs_version = 20.0.0

Script
#!/bin/bash
#PBS -l select=2:ncpus=4
#PBS -l walltime=24:00:00
#PBS -N Helloworld
#PBS -m abe
#PBS -M mazeh01@hotmail.com
cd $PBS_O_WORKDIR
/mnt/nfs/helloworld

error file:

ouput file:
SSH is enabled and the default password for the ‘pi’ user has not been changed.
This is a security risk - please login as the ‘pi’ user and type ‘passwd’ to se$

Hello world from processor Pi01, rank 0 out of 1 processors

pi@PiMaster:/mnt/nfs $ ps -ef | grep pbs_
root 621 1 0 17:56 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 640 1 0 17:56 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 714 1 0 17:56 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 843 738 0 17:56 ? 00:00:00 postgres: postgres pbs_datastore 192.168.0.106(38472) idle
root 895 1 0 17:56 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
pi 1257 922 0 18:14 pts/0 00:00:00 grep --color=auto pbs_

pi@Pi01:/mnt/nfs $ ps -ef | grep pbs_
root 582 1 0 16:25 ? 00:00:00 /opt/pbs/sbin/pbs_mom
pi 1302 1013 0 18:14 pts/0 00:00:00 grep --color=auto pbs_

pi@Pi02:/mnt/nfs $ ps -ef | grep pbs_
root 588 1 0 16:25 ? 00:00:00 /opt/pbs/sbin/pbs_mom
pi 1151 1019 0 18:15 pts/0 00:00:00 grep --color=auto pbs_

pi@PiMaster:/mnt/nfs $ systemctl start pbs
==== AUTHENTICATING FOR org.freedesktop.systemd1.manage-units ===
Authentication is required to start ‘pbs.service’.
Authenticating as: , (pi)
Password:
==== AUTHENTICATION COMPLETE ===

pi@PiMaster:/mnt/nfs $ sudo /etc/init.d/pbs start
Starting PBS
PBS comm already running.
PBS scheduler already running.
PBS Server already running.

On Master
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Log;Log opened
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Server@pimaster;pbs_version=20.0.0
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Server@pimaster;pbs_build=mach=N/A:security=N/A:configure_args=N/A
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Server@pimaster;hostname=pimaster;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Server@pimaster;ipv4 interface lo: ip6-loopback
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Server@pimaster;ipv4 interface eth0: PiMaster
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Server@pimaster;ipv6 interface lo: ip6-loopback
03/05/2021 00:00:00;0002;Server@pimaster;Svr;Act;Account file /var/spool/pbs/server_priv/accounting/20210305 opened
03/05/2021 08:12:02;0100;Server@pimaster;Req;;Type 0 request received from root@pimaster, sock=16
03/05/2021 08:12:03;0100;Server@pimaster;Req;;Type 95 request received from root@pimaster, sock=17
03/05/2021 08:12:03;0100;Server@pimaster;Req;;Type 21 request received from root@pimaster, sock=16
03/05/2021 08:12:03;0100;Server@pimaster;Req;;Type 0 request received from root@pimaster, sock=16
03/05/2021 08:12:03;0100;Server@pimaster;Req;;Type 95 request received from root@pimaster, sock=17
03/05/2021 08:12:03;0100;Server@pimaster;Req;;Type 17 request received from root@pimaster, sock=16
03/05/2021 08:12:03;0086;Server@pimaster;Svr;Server@pimaster;Shutdown request from root@pimaster
03/05/2021 08:12:03;0086;Server@pimaster;Svr;Server@pimaster;Starting to shutdown the server, type is Quick
03/05/2021 08:12:03;0001;Server@pimaster;Svr;Server@pimaster;PBS server internal error (15011) in svr_save_db, Failed to save server Execution of Prepare$
server closed the connection unexpectedly
This probably means the server terminated abnormally

On node Pi01
03/05/2021 08:12:03;0002;pbs_mom;Svr;Log;Log opened
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;pbs_version=20.0.0
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;hostname=pi01;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: ip6-loopback
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;ipv4 interface eth0: Pi01
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: ip6-loopback
03/05/2021 08:12:03;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.106:15001 on stream 0
03/05/2021 08:12:03;0002;pbs_mom;Svr;im_eof;Server closed connection.
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at PiMaster:15001, stream:1
03/05/2021 08:12:03;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.106:15001 on stream 1
03/05/2021 08:12:03;0002;pbs_mom;Svr;im_eof;Server closed connection.
03/05/2021 08:12:03;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm PiMaster:17001 down
03/05/2021 08:12:03;0001;pbs_mom;Svr;net_down_handler;net down handler called
03/05/2021 08:12:41;0002;pbs_mom;Svr;pbs_mom;caught signal 15
03/05/2021 08:12:41;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
03/05/2021 08:12:41;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
03/05/2021 08:12:41;0002;pbs_mom;Svr;pbs_mom;Is down
03/05/2021 08:12:41;0002;pbs_mom;Svr;Log;Log closed

on node Pi02
03/05/2021 08:12:03;0002;pbs_mom;Svr;Log;Log opened
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;pbs_version=20.0.0
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;hostname=pi02;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: ip6-loopback
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;ipv4 interface eth0: pi02
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: ip6-loopback
03/05/2021 08:12:03;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.106:15001 on stream 0
03/05/2021 08:12:03;0002;pbs_mom;Svr;im_eof;Server closed connection.
03/05/2021 08:12:03;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at PiMaster:15001, stream:1
03/05/2021 08:12:03;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 192.168.0.106:15001 on stream 1
03/05/2021 08:12:03;0002;pbs_mom;Svr;im_eof;Server closed connection.
03/05/2021 08:12:03;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm PiMaster:17001 down
03/05/2021 08:12:03;0001;pbs_mom;Svr;net_down_handler;net down handler called
03/05/2021 08:13:23;0002;pbs_mom;Svr;pbs_mom;caught signal 15
03/05/2021 08:13:23;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Shutting down TPP transport Layer
03/05/2021 08:13:23;0d80;pbs_mom;TPP;pbs_mom(Thread 0);Thrd exiting, had 1 connections
03/05/2021 08:13:23;0002;pbs_mom;Svr;pbs_mom;Is down
03/05/2021 08:13:23;0002;pbs_mom;Svr;Log;Log closed

Hello Adarsh,

Do you see indications of anything wrong? I am still trying to figure out why the PBS script is not fully executed. If I execute an mpiexec with a hostfile it works but with the PBS script it terminates before completing the task.
Thanks for your help.

Hi @nmazeh

Could you please try executing simple script , please run the below script (make sure the quotes and hyphens are correct, when you copy paste, it might be correct)
cat pbsscript.sh

#!/bin/bash
#PBS -N test
#PBS -l select=1:ncpus=1
echo “Hi”
date
echo “############################”
env
echo “############################”
echo “bye”
exit 0

qsub pbsscript.sh

  • check stdout and stderr of this job.

Please let us know the outcome

Hi Adarsh

Here is the output file:

SSH is enabled and the default password for the ‘pi’ user has not been changed.
This is a security risk - please login as the ‘pi’ user and type ‘passwd’ to set a new password.

Hi
Tue Mar 9 17:00:18 EST 2021
########################
SHELL=/bin/bash
PBS_TASKNUM=1
PBS_JOBID=4022.PiMaster
PBS_O_PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games:/usr/games:/opt/pbs/bin:/opt/mpi/bin
PBS_NODEFILE=/var/spool/pbs/aux/4022.PiMaster
PBS_O_SYSTEM=Linux
NO_AT_BRIDGE=1
PBS_ENVIRONMENT=PBS_BATCH
PWD=/home/pi
PBS_O_QUEUE=dev
LOGNAME=pi
PBS_JOBCOOKIE=39A141E42E4A36E510D46C8C805912D6
MANPATH=:/opt/pbs/share/man
PBS_O_HOME=/home/pi
PBS_MOMPORT=15003
PBS_JOBNAME=test
PBS_NODENUM=0
HOME=/home/pi
NCPUS=1
PBS_JOBDIR=/home/pi
TMPDIR=/var/tmp/pbs.4022.PiMaster
PBS_O_LANG=en_US.UTF-8
PBS_O_LOGNAME=pi
PBS_O_MAIL=/var/mail/pi
PBS_QUEUE=dev
USER=pi
ENVIRONMENT=BATCH
PBS_O_HOST=pimaster
SHLVL=2
OMP_NUM_THREADS=1
PBS_O_SHELL=/bin/bash
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games:/usr/games:/opt/pbs/bin
PBS_O_WORKDIR=/mnt/nfs
TEXTDOMAIN=Linux-PAM
_=/usr/bin/env
########################
bye

Let me know.
Thanks

Thank you @nmazeh . There are no issues, the job has completed successfully.
Did you find any issues with respect to this job?

Hello Adarsh,
No, I did not find any issues with this specific job but could we test broadcasting to other nodes? I don’t understand why the script I am running is ending prematurely:

-----03/05/2021 08:12:03;0001;Server@pimaster;Svr;Server@pimaster;PBS server internal error (15011) in svr_save_db, Failed to save server Execution of Prepare$
server closed the connection unexpectedly
This probably means the server terminated abnormally------

Is there some settings that we are missing?

Again here is my code:
#include <mpi.h>
#include <stdio.h>
int main(int argc, char** argv) {
// initialize MPI environment
MPI_Init(NULL,NULL);
//get # of processes
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
//get rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
//get name of processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
//print a hello world message
printf(“Hello world from processor %s, rank %d out of %d processors\n”, processor_na$
//Finalize MPI environment
MPI_Finalize();
}

and the script:
#!/bin/bash
#PBS -l select=2:ncpus=4
#PBS -l walltime=24:00:00
#PBS -N Helloworld
#PBS -m abe
#PBS -M mazeh01@hotmail.com
cd $PBS_O_WORKDIR
/mnt/nfs/helloworld

And the output file:
SSH is enabled and the default password for the ‘pi’ user has not been changed.
This is a security risk - please login as the ‘pi’ user and type ‘passwd’ to set a n$

Hello world from processor Pi01, rank 0 out of 1 processors

There is an issue because it is not completing the task of broadcasting from 2 nodes and 4 processors.
Thanks.

without using PBS Pro , can your run this manually on multiple nodes ?
You would have to use something like this depending on the flavour of MPI ( intel , platform, openmpi etc) mpirun -np <total_no_of_cpus> -host hostfile /mnt/nfs/helloword

The PBS script should look something like this:

#!/bin/bash
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -l place=scatter
#PBS -l walltime=24:00:00
#PBS -N Helloworld
#PBS -m abe
#PBS -M mazeh01@hotmail.com
cd $PBS_O_WORKDIR
total_cores_for_this_job=`cat $PBS_NODEDILE | wc -l`
/path/to/mpirun -np $total_cores_for_this_job -host $PBS_NODEDILE  /mnt/nfs/helloworld 

Thank you

Adarsh, then I need to understand what is available with OpenPBS. If I run the following script it works:
#!/bin/bash
#PBS -l select=2:ncpus=4:mpiprocs=4
#PBS -l place=scatter
#PBS -l walltime=24:00:00
#PBS -N Helloworld
#PBS -m abe
#PBS -M mazeh01@hotmail.com
cd $PBS_O_WORKDIR
total_cores_for_this_job=cat $PBS_NODEDILE | wc -l
mpiexec -np 8 -hostfile machinefile /mnt/nfs/helloworld

But am I using the PBS to broadcast to the nodes? When I will have 30 running nodes and I need to run multiple instances of my code on different nodes and have some queued, will the above script be useful?

Does OpenPBS allow for running multiple instances distributed to specific nodes and have a queue for the jobs?

Thank you again for your wonderful help.
Best.

@nmazeh

openPBS supports

  • Serial jobs (SMP)
  • Parallel jobs (MPP / SPMD / MPI )
  • OpenMP / Hybrid jobs
  • Job arrays
  • interactive jobs

Sure, openPBS will take care of this transparently by default.
instead of -hostfile machinefile you need to replace it with -hostfile $PBS_NODEFILE

By default it is supported by openPBS , this is the core functionality of any work load manager or queuing systems.

Please search for MPI in this document: https://www.altair.com/pdfs/pbsworks/PBSAdminGuide2020.1.pdf

Cheers

Instructions for submitting jobs that use MPI are in the PBS Professional User’s Guide.

Thank you all for you help. Where is the PBS_NODEFILE stored at and is there a format for it listing the nodes?
Cheers

Please try this script, using

qsub sample.sh

cat sample.sh

#!/bin/bash
#PBS -N pbsnodefile
#PBS -l select=2:ncpus=4:mpiprocs=4
echo $PBS_NODEFILE
cat $PBS_NODEFILE
exit 0

I did as below and the job is in queue without progress

pi@PiMaster:/mnt/nfs $ cat sample.sh
#!/bin/bash
#PBS -N pbsnodefile
#PBS -l select=2:ncpus=4:mpiprocs=4
echo $PBS_NODEFILE
cat PBS_NODEFILE exit 0 pi@PiMaster:/mnt/nfs qstat
Job id Name User Time Use S Queue


5001.PiMaster My-OpenMP-Scrip* pi 0 Q dev
pi@PiMaster:/mnt/nfs $ qstat
Job id Name User Time Use S Queue


5001.PiMaster My-OpenMP-Scrip* pi 0 Q dev
pi@PiMaster:/mnt/nfs $

Sorry, retyped below:
pi@PiMaster:/mnt/nfs $ cat sample.sh
#!/bin/bash
#PBS -N pbsnodefile
#PBS -l select=2:ncpus=4:mpiprocs=4
echo $PBS_NODEFILE
cat PBS_NODEFILE exit 0 pi@PiMaster:/mnt/nfs qstat
Job id Name User Time Use S Queue


5001.PiMaster My-OpenMP-Scrip* pi 0 Q dev

pi@PiMaster:/mnt/nfs $ qstat
Job id Name User Time Use S Queue


5001.PiMaster My-OpenMP-Scrip* pi 0 Q dev

pi@PiMaster:/mnt/nfs $