Hello Community,
Currently we’re migrating from CentOS 7 to Ubuntu 20.04 for our cloud based HPC cluster and using scheduler OpenPBS. And right now we’re in bad shape as next year May 2025 Ubuntu 20.04 Maintenance support will be end. So moving out of OpenPBS will be going to great challenge for us as openpbs does not support ubuntu 20.04 onwards.
So Is there possibility to get OpenPBS version support on Ubuntu 22.04 in 2025?
This is a very hot topic, i think common for many of us.
Is there anyone who is working on this problem?
I’m facing this right now. I’ll try to compile from the source and let you know.
1
git clone GitHub - openpbs/openpbs: An HPC workload manager and job scheduler for desktops, clusters, and clouds.
2
apt install gcc make libtool libhwloc-dev libx11-dev
libxt-dev libedit-dev libical-dev ncurses-dev perl
postgresql-server-dev-all postgresql-contrib python3-dev tcl-dev tk-dev swig
libexpat-dev libssl-dev libxext-dev libxft-dev autoconf
automake g++ libcjson-dev
3
cd openpbs
./autogen.sh
./configure --prefix=/opt/pbs
make
sudo make install
4
sudo /opt/pbs/libexec/pbs_postinstall
5
edit /etc/pbs.conf and set PBS_START_MOM=1
chmod 4755 /opt/pbs/sbin/pbs_iff /opt/pbs/sbin/pbs_rcp
Attention: it will not work on 127.0.0.1. I had to create a name referencing the real IP address. You may check this with netstat -anp
. If you see 127.0.0.1:15001 it will fail. If you see 0.0.0.0:15001 it may work.
In Ubuntu 22.04 you will use sudo systemctl start pbs
instead of
sudo service pbs_server start
sudo service pbs_sched start
sudo service pbs_mom start
As root, you need to create the master PBS node.
Imagine that my hostname is gput4:
qmgr -c "create node gput4"
I did a dummy test and it worked, now I’m figuring out how to add more nodes.
References:
[1] Installing TORQUE
[4] Installing TORQUE
Thank you Julio. I get this error when running pbsnodes -a:
pbsnodes -a
Connection refused
pbsnodes: cannot connect to server ubuntu, error=15010
My server is named “ubuntu”. Everything in your instructions seemed to work, and I see pbs running in the systemctl output. Could this be something I can solve?
Please the /etc/hosts file and /etc/pbs.conf of the PBS Server.
Here is the /etc/hosts file:
127.0.1.1 ubuntu
127.0.0.1 localhost
::1 localhost ip6-localhost ip6-loopback
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
And here is the /etc/pbs.conf file:
PBS_SERVER=ubuntu
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
Thank you @dholler
This is the problem, replace this loopback with static IP address for the host.
Please follow this: Job stack in queue after fresh install | Permission error 15008
Thank you @adarsh. After following the other thread, I get the correct output for this:
pbs_hostn -v ubuntu
primary name: ubuntu (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: (17043628 dec) name: ubuntu
However, I still get this error with pbsnodes:
pbsnodes -a
Connection refused
pbsnodes: cannot connect to server ubuntu, error=15010
My Ubuntu firewall is inactive and here is my updated /etc/hosts file:
cat /etc/hosts
ubuntu
localhost
I removed the static IP from these command outputs, but it indeed shows up in the terminal. Do I need to adjust settings for port 15010?
Request to please ready this document
Your /etc/hosts needs to be updated,
For example:
192.168.65.190 (servernode) , 192.168.65.191 (compute1), 192.168.65.192 (compute2)are the static IP associated with the respect hostnames
pbsdata@servernode:~$ cat /etc/hosts
127.0.0.1 localhost
192.168.65.190 servernode
192.168.65.191 compute1
192.168.65.192 compute2
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Thank you, I will read the document you sent. I am using one server node as my compute node. I understand how the IPs normally point to other nodes, but not sure what else I need to do for my machine yet.
You would like to run all in one node - PBS Server , Scheduler, Mom
Then, your configuration with respect to /etc/pbs.conf is correct
Make sure : 15001 to 15009 , 17001 , 22 ports are open for communication within that node ( of you can disable the firewal, if it not in dmz)
Make your your host “ubuntu” is given a static ip and is populated in the /etc/hosts
Restart the PBS Services once these changes are made. If there are any issues. Share the screenshot and logs from the /var/spool/pbs folder
Thank you @adarsh. I opened those ports and restarted PBS, but no luck yet. Here is the /var/spool/pbs/server_logs/20241211 file:
cat /var/spool/pbs/server_logs/20241211
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Log;Log opened
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;pbs_version=23.06.06
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;pbs_build=mach=N/A:security=N/A:configure_args=N/A
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;hostname=ubuntu;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;ipv4 interface lo: localhost
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;ipv4 interface enp5s0: ubuntu
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;ipv6 interface lo: localhost
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;ipv6 interface enp5s0: ubuntu
12/11/2024 09:14:32;0006;Server@ubuntu;Fil;Server@ubuntu;Version 23.06.06, started, initialization type = 1
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;pbs_status_db exit code 1
12/11/2024 09:14:32;0002;Server@ubuntu;Svr;Server@ubuntu;Starting PBS dataservice
12/11/2024 09:14:35;0002;Server@ubuntu;Svr;Server@ubuntu;connected to PBS dataservice@ubuntu
12/11/2024 09:14:35;0d80;Server@ubuntu;TPP;Server@ubuntu(Main Thread);TPP authentication method = resvport
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Main Thread);TPP leaf node names = 172.16.4.1:15001,127.0.0.1:15001,172.16.4.1:15001
12/11/2024 09:14:35;0d80;Server@ubuntu;TPP;Server@ubuntu(Main Thread);Initializing TPP transport Layer
12/11/2024 09:14:35;0d80;Server@ubuntu;TPP;Server@ubuntu(Main Thread);Max files allowed = 16384
12/11/2024 09:14:35;0d80;Server@ubuntu;TPP;Server@ubuntu(Main Thread);TPP initialization done
12/11/2024 09:14:35;0d80;Server@ubuntu;TPP;Server@ubuntu(Main Thread);Connecting to pbs_comm ubuntu:17001
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);Thread ready
12/11/2024 09:14:35;0002;Server@ubuntu;n/a;setup_env;read environment from /var/spool/pbs/pbs_environment
12/11/2024 09:14:35;0000;Server@ubuntu;Svr;Server@ubuntu;Supported authentication method: resvport
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);Registering address 172.16.4.1:15001 to pbs_comm ubuntu:17001
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);Connected to pbs_comm ubuntu:17001
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0002;Server@ubuntu;Svr;Server@ubuntu;Stopping PBS dataservice
12/11/2024 09:14:36;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:36;0c06;Server@ubuntu;TPP;Server@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
And here is the /var/spool/pbs/comm_logs/20241211 file:
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Log;Log opened
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;pbs_version=23.06.06
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;pbs_build=mach=N/A:security=N/A:configure_args=N/A
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;hostname=ubuntu;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;ipv4 interface lo: localhost
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;ipv4 interface enp5s0: ubuntu
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;ipv6 interface lo: localhost
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;ipv6 interface enp5s0: ubuntu
12/11/2024 09:14:32;0002;Comm@ubuntu;Svr;Comm@ubuntu;/opt/pbs/sbin/pbs_comm ready (pid=2173384), Proxy Name:ubuntu:17001, Threads:4
12/11/2024 09:14:32;0000;Comm@ubuntu;Svr;Comm@ubuntu;Supported authentication method: resvport
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);Thread ready
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 0);Thread ready
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 3);Thread ready
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);Thread ready
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:32;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tfd=14, Leaf registered address 172.16.4.1:15003
12/11/2024 09:14:34;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:34;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:34;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:35;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tfd=16, Leaf registered address 172.16.4.1:15001
12/11/2024 09:14:36;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:36;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:36;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:36;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:36;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:37;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:37;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:14:37;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 2);tfd=16, Connection from leaf 172.16.4.1:15001 down
12/11/2024 09:14:37;0c06;Comm@ubuntu;TPP;Comm@ubuntu(Thread 1);tpp_mbox_read;Unable to read from msg box
The last line is repeated every 10 seconds, so I didn’t post all those repeats. Similar errors show up in the /var/spool/pbs/mom_logs/20241211 file:
12/11/2024 09:21:18;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:21:36;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at ubuntu:15001, stream:42
12/11/2024 09:21:36;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:21:36;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:21:36;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
12/11/2024 09:21:36;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 172.16.4.1:15001 on stream 42
12/11/2024 09:21:36;0002;pbs_mom;Svr;im_eof;Server closed connection.
I found this thread that says the error was addressed. I will reach out there too.