Setting up openPBS on only one computer

Hello everyone,

I am new in HPC and have some trouble in setting up openPBS on a Debian 10 (Buster) computer (only one computer, no cluster). I want to set it up for our workgroup to run ORCA and Gaussian for quantum chemistry. The aim is to enable the possibility for our group members to login via SSH to our computer and to submit their ORCA/Gaussian jobs to a queueing system which then starts to process them one after another. I initially started with SLURM but finally gave up on it after failing to get it to run. Now I wanted to try openPBS. I am testing it on my Raspberry Pi with Raspbian Buster before trying it on the server. I followed the install manual for Debian 10 on github to build openPBS. Everything worked fine (except sudo tcl tk libical1a --> tcl command not found).

The services are running:

  • sudo /etc/init.d/pbs status
    pbs_server is pid 945
    pbs_mom is pid 748
    pbs_sched is pid 762
    pbs_comm is 728

The command qstat -B got me a connection error (111) at first but I found the solution here and solved it by replacing the loopback in /etc/hosts with the IP of the system. Now qstat -B gives me:

Server Max Tot Que Run Hld Wat Trn Ext Status


raspberrypi 0 0 0 0 0 0 0 0 Active

The command pbsnodes -a returns:

  • pbsnodes: Server has no node list

I tried to run a sleep job but qsub returns:

  • qsub: No default queue specified

As I said I am not familiar with this kind of program and I am grateful for any help because I do not know how to proceed any further. Since the services are running I guess the problem will have to do something with the configuration or settings.

Please make sure your /etc/hosts is populated and is same across headnode (pbs server) and compute nodes (DNS resolvable )

  1. Add a node
qmgr -c  "create node NODENAME"      OR
qmgr -c "create node NODENAME Mom=Nodename.openpbs.org" 

#NODENAME : is the hostname of the compute node, the output of hostname command on compute node
2. Add a queue

qmgr -c "create queue testq queue_type=e,enabled=t,started=t"

  1. Check pbsnodes -aSjv command output for node status

  2. Check qstat -Bf and qmgr -c "print server" output for rest of the status & configuration details.

  3. To submit job

qsub -q  workq  -- /bin/sleep 100
qsub -q testq  -- /bin/sleep 1000
qstat -answ1  # to know job status

Hope this helps

Hello adarsh thank you for your help.

qmgr commands are all returning the error 15007

  • qmgr -c “create node raspberrypi”
    qmgr obj=raspberrypi svr=default: Unauthorized Request
    qmgr: Error (15007) returned from server

After checking the server logs I think maybe the problem might be a connection error:

  • pi@raspberrypi:~ $ tail -n 10 /var/spool/pbs/server_logs/20201006
    10/06/2020 23:19:43;0100;Server@raspberrypi;Req;;Type 21 request received from Scheduler@raspberrypi.fritz.box, sock=15
    10/06/2020 23:19:43;0100;Server@raspberrypi;Req;;Type 71 request received from Scheduler@raspberrypi.fritz.box, sock=15
    10/06/2020 23:19:43;0100;Server@raspberrypi;Req;;Type 58 request received from Scheduler@raspberrypi.fritz.box, sock=15
    10/06/2020 23:19:43;0080;Server@raspberrypi;Req;req_reject;Reject reply code=15064, aux=0, type=58, from Scheduler@raspberrypi.fritz.box
    10/06/2020 23:19:43;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
    10/06/2020 23:19:47;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
    10/06/2020 23:19:51;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
    10/06/2020 23:20:08;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
    10/06/2020 23:20:12;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
    10/06/2020 23:20:18;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003

This is the server log after a fresh reboot of the system.

These are the contents of my hosts list. I cannot see an error here. (I commented some lines)

  • pi@raspberrypi:~ $ cat /etc/hosts
    #127.0.0.1 localhost
    #::1 localhost ip6-localhost ip6-loopback
    #ff02::1 ip6-allnodes
    #ff02::2 ip6-allrouters

#127.0.1.1 raspberrypi
#192.168.178.92 raspberrypi
192.168.178.92 raspberrypi.fritz.box

and my pbs.conf

  • pi@raspberrypi:~ $ cat /etc/pbs.conf
    PBS_SERVER=raspberrypi.fritz.box
    PBS_START_SERVER=1
    PBS_START_SCHED=1
    PBS_START_COMM=1
    PBS_START_MOM=1
    PBS_EXEC=/opt/pbs
    PBS_HOME=/var/spool/pbs
    PBS_CORE_LIMIT=unlimited
    PBS_SCP=/usr/bin/scp

The PBS_SERVER name is the same as the hostname

  • pi@raspberrypi:~ hostname raspberrypi pi@raspberrypi:~ hostname -f
    raspberrypi.fritz.box

I am also able to ping the hostname.

After using a qmgr command this log is found in the server log:

  • 10/06/2020 23:42:43;0080;Server@raspberrypi;Req;req_reject;Reject reply code=15007, aux=0, type=9, from pi@raspberrypi.fritz.box

Please make sure (admin commands should be run as root user)
Share the output of pbsnodes -aSjv and qstat -Bf

  1. qmgr -c "set server flatuid = true " # use it with caution

  2. 15001 to 15009 and 17001 ports are not blocked between (server and nodes, nodes and server and between the nodes) - or firewall allowing these ports

  3. SELinux is disabled # if you disable it now , system should be rebooted.

In the /etc/hosts , try to use short hostname (alias) for the server
192.168.178.92 raspberrypi.fritz.box raspberrypi

And in the /etc/pbs.conf
PBS_SERVER=rasberrypi

Make sure the same /etc/hosts is reflected on all the participating systems.

Thank you

Hi adarsh,

I changed the hosts file and the pbs.conf file as you suggested.

I ran qmgr -c "set server flatuid = true " without error but I did not noticed any change. However after running qmgr -c “create node raspberrypi” as root I got:

  • root@raspberrypi:~# qmgr -c “create node raspberrypi”
    Unknown Host.
    qmgr: cannot connect to server node
    Unknown Host.
    qmgr: cannot connect to server raspberrypi”

–> no authentication error as root user

pbsnodes gives

  • root@raspberrypi:~# pbsnodes -aSjv
    pbsnodes: Server has no node list

and qstat -Bf

  • root@raspberrypi:~# qstat -Bf
    Server: raspberrypi
    server_state = Active
    server_host = raspberrypi.fritz.box
    scheduling = True
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
    :0
    log_events = 511
    mailer = /usr/sbin/sendmail
    mail_from = adm
    query_other_jobs = True
    resources_default.ncpus = 1
    default_chunk.ncpus = 1
    scheduler_iteration = 600
    flatuid = True
    resv_enable = True
    node_fail_requeue = 310
    max_array_size = 10000
    pbs_license_min = 0
    pbs_license_max = 2147483647
    pbs_license_linger_time = 31536000
    license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
    pbs_version = 20.0.0
    eligible_time_enable = False
    max_concurrent_provision = 5
    power_provisioning = False
    max_job_sequence_id = 9999999

root@raspberrypi:~#

I scanned the ports for my the server IP (the one in /etc/hosts):

  • sudo nmap -p 15001-15009 192.168.178.92
    Starting Nmap 7.70 ( https://nmap.org ) at 2020-10-07 17:58 CEST
    Nmap scan report for raspberrypi.fritz.box (192.168.178.92)
    Host is up (0.020s latency).

PORT STATE SERVICE
15001/tcp open unknown
15002/tcp open onep-tls
15003/tcp open unknown
15004/tcp closed unknown
15005/tcp closed unknown
15006/tcp closed unknown
15007/tcp open unknown
15008/tcp closed unknown
15009/tcp closed unknown

PORT STATE SERVICE
17001/tcp open unknown

I should mention again that in my case the server and the compute node are one and the same machine. No cluster intended.

My Linux distro does not use SELinux or a firewall.

Please share the output of these commands:

  1. ps -ef | grep pbs_
  2. hostname
  3. ping $(hostname)
  4. pbs_hostn -v $(hostname)

Could you please check the PBS Server logs when you run the below command ?

root@raspberrypi:~# qmgr -c “create node raspberrypi”
root@raspberrypi:~# qmgr -c “create node raspberrypi Mom=raspberrypi.fritz.box”
  • root@raspberrypi:~# ps -ef | grep pbs_
    root 727 1 0 22:28 ? 00:00:00 /opt/pbs/sbin/pbs_comm
    root 747 1 0 22:28 ? 00:00:00 /opt/pbs/sbin/pbs_mom
    root 761 1 0 22:28 ? 00:00:00 /opt/pbs/sbin/pbs_sched
    root 842 1 0 22:28 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
    postgres 941 868 0 22:28 ? 00:00:00 postgres: postgres pbs_datastore 192.168.178.92(51502) idle
    root 942 1 0 22:28 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
    root 1086 1065 0 22:30 pts/0 00:00:00 grep pbs_
  • root@raspberrypi:~# hostname
    raspberrypi

root@raspberrypi:~# ping $(hostname)
PING raspberrypi.fritz.box (192.168.178.92) 56(84) bytes of data.
64 bytes from raspberrypi.fritz.box (192.168.178.92): icmp_seq=1 ttl=64 time=0.086 ms

4,
root@raspberrypi:~# pbs_hostn -v $(hostname)
primary name: raspberrypi.fritz.box (from gethostbyname())
aliases: raspberrypi
address length: 4 bytes
address: 192.168.178.92 (1555212480 dec) name: raspberrypi.fritz.box
root@raspberrypi:~#

The log after using “qmgr -c “create node raspberrypi” :

  • 10/07/2020 22:37:47;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003

This log is repeated every 18 seconds. Executing the command only seems to reset the 18 sec timer and gives the following message in the shell:

Unknown Host.
qmgr: cannot connect to server node
Unknown Host.
qmgr: cannot connect to server raspberrypi”

qmgr -c “create node raspberrypi Mom=raspberrypi.fritz.box” results in the following server log:

  • 10/07/2020 22:47:46;0100;Server@raspberrypi;Req;;Type 0 request received from root@raspberrypi.fritz.box, sock=17
    10/07/2020 22:47:46;0100;Server@raspberrypi;Req;;Type 95 request received from root@raspberrypi.fritz.box, sock=18

It seems that the services + server are running and that the server receives some commands but is still unable to connect to itself on the same machine?

Thanks again for sharing the requested information

Please add the below line to your /etc/pbs.conf file

PBS_LEAF_NAME=192.168.178.92

  • restart the pbs services
  • add the compute node using qmgr

** I hope hostname -i output is 192.168.178.92

Thank you for your patience.

I hope hostname -i output is 192.168.178.92

It is.

Please add the below line to your /etc/pbs.conf file

PBS_LEAF_NAME=192.168.178.92

  • restart the pbs services
    add the compute node using qmgr

I added the line and restarted the computer. But adding a node via qmgr changed nothing

  • Unknown Host.
    qmgr: cannot connect to server node
    Unknown Host.
    qmgr: cannot connect to server raspberrypi”

My guess is that you still have an issue in your /etc/hosts file. I would restore the entries for localhost and then put both the long and short names on the 192.168.178.92 line. The result should include these lines:

127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
192.168.178.92  raspberrypi.fritz.box raspberrypi

Next, you report that qmgr says:

  • Unknown Host.
    qmgr: cannot connect to server node
    Unknown Host.
    qmgr: cannot connect to server raspberrypi”

Is there really a ” at the end of the last line from qmgr? If so, you somehow introduced a special character into the configuration somewhere. Since you’re just starting, it might be good to reinitialize PBS and start fresh. I haven’t done this in a while, but I think the command is

/opt/pbs/sbin/pbs_server -t create

Ignore that last part about re-initializing. The problem is that this bulletin board system converts computer quotes into text quotes unless you tell it to leave them alone. Thus, if someone asks you to type:

qmgr -c "s foo bar"

The board converts that to

qmgr -c “s foo bar”

If you copy/paste that directly from the web, it tells qmgr to try to connect to server bar”.

So, go back and try the commands again, copying them manually and using only computer quotes ".

1 Like

After your suggestion I think I found the error. It was indeed a copy&paste error. Now the qmgr commands are working. I successfully created the node “raspberry” and the queue “testq”. When I try to run a test job as root I get an UID error. I guess only normal users are permitted to run jobs because it worked with a normal user?

1 Like

Correct.
To run jobs as root user is not allowed by default. One can enable it by setting s s acl_roots=root using qmgr command.

I recently installed openPBS on the machine in my workgroup. But I had to switch the OS to CentOS 8. So I installed the openpbs-server.rpm. The services are running, but I had to disable SELinux for it. Problem is most commands are not working. I have to CTRL-C them in terminal. pbsnodes --version for example is working but pbsnodes -aSjv is not. qmgr is not working either. The commands are not working regardless if I am logged in as root or as a normal user. I even changed the owner of /opt/pbs from root to the standard user via chown -R. What could be wrong here?

Please check the permissions of pbs_iff and pbs_rcp

source /etc/pbs.conf
chmod 4755 $PBS_EXEC/sbin/pbs_iff
chmod 4755 $PBS_EXEC/sbin/pbs_rcp

Also, you might have to run the $PBS_EXEC/sbin/pbs_probe -fv ( this might correct the permissions on the top level pbs folders, rest you would have to update it ).
Hope this helps

I found the problem. I was using a non-login shell. The commands are working via SSH. Now pbs status shows me that pbs_server is not running (mom, sched and comm are running).

  1. /etc/init.d/pbs start
  2. check and share the $PBS_HOME/server_logs/YYYYMMDD
  3. check and share the datastore logs