I am new in HPC and have some trouble in setting up openPBS on a Debian 10 (Buster) computer (only one computer, no cluster). I want to set it up for our workgroup to run ORCA and Gaussian for quantum chemistry. The aim is to enable the possibility for our group members to login via SSH to our computer and to submit their ORCA/Gaussian jobs to a queueing system which then starts to process them one after another. I initially started with SLURM but finally gave up on it after failing to get it to run. Now I wanted to try openPBS. I am testing it on my Raspberry Pi with Raspbian Buster before trying it on the server. I followed the install manual for Debian 10 on github to build openPBS. Everything worked fine (except sudo tcl tk libical1a --> tcl command not found).
The services are running:
sudo /etc/init.d/pbs status
pbs_server is pid 945
pbs_mom is pid 748
pbs_sched is pid 762
pbs_comm is 728
The command qstat -B got me a connection error (111) at first but I found the solution here and solved it by replacing the loopback in /etc/hosts with the IP of the system. Now qstat -B gives me:
Server Max Tot Que Run Hld Wat Trn Ext Status
raspberrypi 0 0 0 0 0 0 0 0 Active
The command pbsnodes -a returns:
pbsnodes: Server has no node list
I tried to run a sleep job but qsub returns:
qsub: No default queue specified
As I said I am not familiar with this kind of program and I am grateful for any help because I do not know how to proceed any further. Since the services are running I guess the problem will have to do something with the configuration or settings.
qmgr -c âcreate node raspberrypiâ
qmgr obj=raspberrypi svr=default: Unauthorized Request
qmgr: Error (15007) returned from server
After checking the server logs I think maybe the problem might be a connection error:
pi@raspberrypi:~ $ tail -n 10 /var/spool/pbs/server_logs/20201006
10/06/2020 23:19:43;0100;Server@raspberrypi;Req;;Type 21 request received from Scheduler@raspberrypi.fritz.box, sock=15
10/06/2020 23:19:43;0100;Server@raspberrypi;Req;;Type 71 request received from Scheduler@raspberrypi.fritz.box, sock=15
10/06/2020 23:19:43;0100;Server@raspberrypi;Req;;Type 58 request received from Scheduler@raspberrypi.fritz.box, sock=15
10/06/2020 23:19:43;0080;Server@raspberrypi;Req;req_reject;Reject reply code=15064, aux=0, type=58, from Scheduler@raspberrypi.fritz.box
10/06/2020 23:19:43;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
10/06/2020 23:19:47;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
10/06/2020 23:19:51;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
10/06/2020 23:20:08;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
10/06/2020 23:20:12;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
10/06/2020 23:20:18;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
This is the server log after a fresh reboot of the system.
These are the contents of my hosts list. I cannot see an error here. (I commented some lines)
I changed the hosts file and the pbs.conf file as you suggested.
I ran qmgr -c "set server flatuid = true " without error but I did not noticed any change. However after running qmgr -c âcreate node raspberrypiâ as root I got:
root@raspberrypi:~# qmgr -c âcreate node raspberrypiâ
Unknown Host.
qmgr: cannot connect to server node
Unknown Host.
qmgr: cannot connect to server raspberrypiâ
â> no authentication error as root user
pbsnodes gives
root@raspberrypi:~# pbsnodes -aSjv
pbsnodes: Server has no node list
I scanned the ports for my the server IP (the one in /etc/hosts):
sudo nmap -p 15001-15009 192.168.178.92
Starting Nmap 7.70 ( https://nmap.org ) at 2020-10-07 17:58 CEST
Nmap scan report for raspberrypi.fritz.box (192.168.178.92)
Host is up (0.020s latency).
PORT STATE SERVICE
15001/tcp open unknown
15002/tcp open onep-tls
15003/tcp open unknown
15004/tcp closed unknown
15005/tcp closed unknown
15006/tcp closed unknown
15007/tcp open unknown
15008/tcp closed unknown
15009/tcp closed unknown
PORT STATE SERVICE
17001/tcp open unknown
I should mention again that in my case the server and the compute node are one and the same machine. No cluster intended.
My Linux distro does not use SELinux or a firewall.
The log after using âqmgr -c âcreate node raspberrypiâ :
10/07/2020 22:37:47;0001;Server@raspberrypi;Svr;Server@raspberrypi;is_request, bad attempt to connect from 192.168.178.92:15003
This log is repeated every 18 seconds. Executing the command only seems to reset the 18 sec timer and gives the following message in the shell:
Unknown Host.
qmgr: cannot connect to server node
Unknown Host.
qmgr: cannot connect to server raspberrypiâ
qmgr -c âcreate node raspberrypi Mom=raspberrypi.fritz.boxâ results in the following server log:
10/07/2020 22:47:46;0100;Server@raspberrypi;Req;;Type 0 request received from root@raspberrypi.fritz.box, sock=17
10/07/2020 22:47:46;0100;Server@raspberrypi;Req;;Type 95 request received from root@raspberrypi.fritz.box, sock=18
It seems that the services + server are running and that the server receives some commands but is still unable to connect to itself on the same machine?
My guess is that you still have an issue in your /etc/hosts file. I would restore the entries for localhost and then put both the long and short names on the 192.168.178.92 line. The result should include these lines:
Unknown Host.
qmgr: cannot connect to server node
Unknown Host.
qmgr: cannot connect to server raspberrypiâ
Is there really a â at the end of the last line from qmgr? If so, you somehow introduced a special character into the configuration somewhere. Since youâre just starting, it might be good to reinitialize PBS and start fresh. I havenât done this in a while, but I think the command is
Ignore that last part about re-initializing. The problem is that this bulletin board system converts computer quotes into text quotes unless you tell it to leave them alone. Thus, if someone asks you to type:
qmgr -c "s foo bar"
The board converts that to
qmgr -c âs foo barâ
If you copy/paste that directly from the web, it tells qmgr to try to connect to server barâ.
So, go back and try the commands again, copying them manually and using only computer quotes ".
After your suggestion I think I found the error. It was indeed a copy&paste error. Now the qmgr commands are working. I successfully created the node âraspberryâ and the queue âtestqâ. When I try to run a test job as root I get an UID error. I guess only normal users are permitted to run jobs because it worked with a normal user?
I recently installed openPBS on the machine in my workgroup. But I had to switch the OS to CentOS 8. So I installed the openpbs-server.rpm. The services are running, but I had to disable SELinux for it. Problem is most commands are not working. I have to CTRL-C them in terminal. pbsnodes --version for example is working but pbsnodes -aSjv is not. qmgr is not working either. The commands are not working regardless if I am logged in as root or as a normal user. I even changed the owner of /opt/pbs from root to the standard user via chown -R. What could be wrong here?
Also, you might have to run the $PBS_EXEC/sbin/pbs_probe -fv ( this might correct the permissions on the top level pbs folders, rest you would have to update it ).
Hope this helps
I found the problem. I was using a non-login shell. The commands are working via SSH. Now pbs status shows me that pbs_server is not running (mom, sched and comm are running).