Is there any news for the next upcoming release?

Hello, first of all thanks for the great software that OpenPBS is and the way it is provided free of charge.

I would like to know if there’s any plans for the next upcoming release. On the Github page there’s nothing on milestones, so I’m just curious about it because I’m interested on NVIDIA DGX support (with MIG) and this is only available right now on master.

We can always compile from master, but since the last release was almost a year ago, I’m trying to wait for the next release version instead of compiling it by myself.

Thanks.

Thank you and welcome!

The team has had several conversation about the “next release”, but nothing has been decided. Current thinking is that nothing would happen until at least early Fall, but, again, ongoing discussions. I suggest building from master if you need the latest features.

Let us know how it goes!

1 Like

Hi @billnitzberg thanks for the reply. I’ve already downloaded the source and was able to easily compile OpenPBS. It’s up and running now.

But now I’ve a question. I’ve done this on the “headnode” machine, however now I want to push the binaries to my compute nodes. It does not seems to be reasonable to ./configure, make and make install on each compute node.

Is there any script available to generate .rpm files with the compiled version? If not I’m thinking to copy just the pbs_mom binaries to the compute nodes, but there’s any advise to do so? Anything that I should look?

Thank you.

Using RPMs is probably the best way – I thought there were instructions on how to build those, but don’t see them in the INSTALL. (Maybe there is a different way now.)

Anyone else know if there are RPM build instructions? (If not, then, yep… build on one MOM and copy from there.)

Well, a quick search on the Contributor’s Portal (openpbs.atlassian.net) revealed:

https://openpbs.atlassian.net/wiki/spaces/PBSPro/pages/13991940/Building+OpenPBS+Using+rpmbuild

Aha! That was good. Thanks @billnitzberg

I’ll try it and report back. I’ve thought that Confluence and Jira were deprecated for OpenPBS.

Hello @billnitzberg, I was able to successfully compile to RPM packages. I’m doing some tests right now to check if everything is right.

But as a disclaimer, I had to change the openpbs.spec file to properly generate the RPM files without debug package ID. To do it I added this line just before %install: %define debug_package %{nil}

So that’s it. I’ll start testing right now.
Thanks!

Well let’s start the debugging.

Execution cannot connect to the server. Looking at the logs it seems that pbs_sched is not starting up. Not sure why, because if I try to start the sched by hand it complains that another scheduler is running:

[root@headnode ~]# /opt/pbs/sbin/pbs_sched
pbs_sched: Resource temporarily unavailable (11) in pbs_sched, another scheduler running
pbs_sched: another scheduler running
[root@headnode ~]# ss -tlpn | grep pbs
LISTEN 0 1000 0.0.0.0:17001 0.0.0.0:* users:((“pbs_comm”,pid=1389382,fd=15))
LISTEN 0 256 0.0.0.0:15001 0.0.0.0:* users:((“pbs_server.bin”,pid=1389601,fd=9))

But I can’t see anything attached as scheduler.

On the other hand, on a given client mom fails to register itself:

06/06/2021 23:58:05;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connection to pbs_comm headnode:17001 down

06/06/2021 23:58:05;0001;pbs_mom;Svr;net_down_handler;net down handler called

06/06/2021 23:58:17;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address 172.26.255.254:15003 to pbs_comm headnode:17001

06/06/2021 23:58:17;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm headnode:17001

06/06/2021 23:58:17;0001;pbs_mom;Svr;net_restore_handler;net restore handler called

06/06/2021 23:58:35;0002;pbs_mom;Svr;pbs_mom;HELLO sent to server at headnode:15001, stream:74

06/06/2021 23:58:35;0001;pbs_mom;Svr;pbs_mom;im_eof, Premature end of message from addr 172.26.255.254:15001 on stream 74

06/06/2021 23:58:35;0002;pbs_mom;Svr;im_eof;Server closed connection.

Trying to guess what’s wrong right now.

EDIT: It appers that pbs_sched is running, but it does not open any ports:
[root@headnode ~]# ps ax | grep -i pbs
1394155 ? Ssl 0:00 /opt/pbs/sbin/pbs_comm
1394170 ? Ssl 0:00 /opt/pbs/sbin/pbs_sched
1394240 ? Ss 0:00 /opt/pbs/sbin/pbs_ds_monitor monitor
1394276 ? S 0:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
1394331 ? Ss 0:00 postgres: postgres pbs_datastore 172.26.255.254(50348) idle
1394332 ? Ssl 0:00 /opt/pbs/sbin/pbs_server.bin
1394348 pts/75 S+ 0:00 grep --color=auto -i pbs

  1. make sure ports 15001 to 15007 , 17001 are not blocked and SELinux disabled & system is rebooted after disabling SELinux.
  2. make sure the /etc/hosts is populated correctly ( DNS is properly configured for forward and reverse address resolution) on the headnode and across all the compute nodes
    Note: static IP is a must and it should be resolved/reverse resolved into a correct hostname.

Starting and Stopping of all pbs services:
systemctl start pbs
systemctl stop pbs
systemctl status pbs
[ This will start all the services enabled or set to 1 in the /etc/pbs.conf, we do not have to start explicitly for each services. ]

Probably, stop the pbs services first (make sure there are no stray processes left, if there are kill them, i hope you are not running two different WLM on the same server, if yes, you need to un-install or disable the other)

Please check the output of qstat -Bf and pbsnodes -aSjv

Hi @adarsh thanks for helping out.

On the machine we have SELinux disabled and rebooted, so it’s not in permissive mode. Firewall is not blocking anything either. All the services starts and open the right ports, only pbs_sched that fails to do so:

[root@headnode ~]# ss -tlpn | grep pbs
LISTEN 0 1000 0.0.0.0:17001 0.0.0.0:* users:((“pbs_comm”,pid=1394155,fd=15))
LISTEN 0 256 0.0.0.0:15001 0.0.0.0:* users:((“pbs_server.bin”,pid=1394332,fd=9))

But pbs_sched is in fact running:
[root@headnode ~]# ps ax | grep -i pbs_sched
1394170 ? Ssl 0:00 /opt/pbs/sbin/pbs_sched
1921813 pts/49 S+ 0:00 grep --color=auto -i pbs_sched

DNS is fully working too, everything is populated on /etc/hosts and xCAT’s named is working. I can resolve everything: hostname (short), full qualified domain name and reverse IP lookups.

So basically is this, nodes can’t register either with the server. But that’s it.

Here’s the output that I’ve requested:
[root@headnode ~]# qstat -Bf
Server: headnode.domain.tld
server_state = Active
server_host = headnode
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
:0
default_queue = workq
log_events = 511
mailer = /usr/sbin/sendmail
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
resources_default.place = scatter
default_chunk.ncpus = 1
scheduler_iteration = 600
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
default_qsub_arguments = -V
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
pbs_version = 20.0.0
eligible_time_enable = False
job_history_enable = True
max_concurrent_provision = 5
power_provisioning = False
max_job_sequence_id = 9999999

[root@headnode ~]# pbsnodes -aSjv
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs


n01.domain.tld state-unknown 0 0 0 0kb/0kb 0/0 0/0 0/0 –

[root@headnode ~]# cat /etc/pbs.conf

PBS_SERVER=headnode.domain.tld

PBS_START_SERVER=1

PBS_START_SCHED=1

PBS_START_COMM=1

PBS_START_MOM=0

PBS_EXEC=/opt/pbs

PBS_HOME=/var/spool/pbs

PBS_CORE_LIMIT=unlimited

PBS_SCP=/usr/bin/scp

Is it possible to be a bug on the master?

State-unknown has this reason: Node is not usable. Since server’s latest start, no communication
with this vnode. May be network or hardware problem, or no MoM on vnode

PBS Server:
cat /etc/hosts | grep -e headnode -e n01
pbs_hostn -v headnode
pbs_hostn -v n01

n01:
cat /etc/hosts | grep -e headnode -e n01
pbs_hostn -v headnode
pbs_hostn -v n01

Between PBS Server and n01 – ports 15001 to 15007 and 17001 should be open for communication.
I hope these systems does not have multipe network adapters pointing to different IP / hostnames.

Can you try deleting the node and adding it again
qmgr : d n n01
qmgr : c n n01 Mom=n01.domain.tld or c n n01

Hi @adarsh thanks again.

Yes this is definitely a communication problem, but something is messy. The node can connect to the headnode, but the connection is ended prematurely. I tried attaching strace to the pbs_mom and he is trying to talk with the server but failing immediately.

To narrow things down I’ve enabled pbs_mom on the headnode, so the network is not involved and the same issue is happening. So here we go, the output requested:

on the headnode

[root@headnode ~]# cat /etc/hosts | grep -e headnode -e n01
172.26.255.254 headnode headnode.domain.tld
172.26.0.1 n01 n01.domain.tld
172.27.0.1 n01-ib0 n01-ib0.domain.tld

[root@headnode ~]#pbs_hostn -v headnode
primary name: headnode (from gethostbyname())
aliases: headnode.domain.tld
address length: 4 bytes
address: 172.26.255.254 (4278131372 dec) name: headnode

[root@headnode ~]# pbs_hostn -v n01
primary name: n01 (from gethostbyname())
aliases: n01.domain.tld
address length: 4 bytes
address: 172.26.0.1 (16784044 dec) name: n01

on the compute node

[root@n01 ~]# cat /etc/hosts | grep -e headnode -e n01

[root@n01 ~]# pbs_hostn -v headnode
primary name: headnode.domain.tld (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 172.26.255.254 (4278131372 dec) name: headnode.domain.tld

[root@n01 ~]# pbs_hostn -v n01
primary name: n01.domain.tld (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 172.26.0.1 (16784044 dec) name: n01.domain.tld
[root@n01 ~]#

Observe that /etc/hosts on the compute node does not have the required info because it’s provided by DNS:

[root@n01 ~]# nslookup headnode
Server: 172.26.255.254
Address: 172.26.255.254#53

Name: headnode.domain.tld
Address: 172.26.255.254

[root@n01 ~]# nslookup headnode.domain.tld
Server: 172.26.255.254
Address: 172.26.255.254#53

Name: headnode.domain.tld
Address: 172.26.255.254

[root@n01 ~]# nslookup 172.26.255.254
254.255.26.172.IN-ADDR.ARPA name = headnode.domain.tld.

For the last experiments with MOM on the headnode, the service started but stays in a non functioning state.

[root@headnode ~]# ps ax | grep -i pbs
2588 ? Ssl 0:00 /opt/pbs/sbin/pbs_comm
2611 ? Ssl 0:00 /opt/pbs/sbin/pbs_mom
2709 ? Ssl 0:00 /opt/pbs/sbin/pbs_sched
3216 ? Ss 0:00 /opt/pbs/sbin/pbs_ds_monitor monitor
3270 ? S 0:00 /usr/bin/postgres -D /var/spool/pbs/datastore -p 15007
3447 ? Ss 0:00 postgres: postgres pbs_datastore 172.26.255.254(57626) idle
3564 ? Ssl 0:00 /opt/pbs/sbin/pbs_server.bin
15578 pts/77 S+ 0:00 grep --color=auto -i pbs
[root@headnode ~]# ss -tlpn | grep -i pbs
LISTEN 0 1000 0.0.0.0:17001 0.0.0.0:* users:((“pbs_comm”,pid=2588,fd=15))
LISTEN 0 256 0.0.0.0:15001 0.0.0.0:* users:((“pbs_server.bin”,pid=3564,fd=9))
LISTEN 0 256 0.0.0.0:15002 0.0.0.0:* users:((“pbs_mom”,pid=2611,fd=7))
LISTEN 0 256 0.0.0.0:15003 0.0.0.0:* users:((“pbs_mom”,pid=2611,fd=8))

PS: I was reading about PTL. I think I will recompile OpenPBS with PTL enable so we can at least see what’s wrong? Or this is a bad idea?

1 Like

PBS_MOM_START=0 in the /etc/pbs.conf , it should be set to 1 , if you want the pbs_mom on the headnode to start when the pbs services are started/stopped

You would need to add the , headnode as compute node using qmgr
qmgr : create node headnode

and then run pbsnodes -av or pbsnodes -aSjv

Hi @adarsh I’ve already done that. And as expected is didn’t worked either.

I really believe now that the software is broken. PBS says that the headnode is down and fetched the info about the compute node, the system RAM and ncpus are different on the headnode but it shows with the data from the compute node:

[root@headnode ~]# pbsnodes -aSjv
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs


adano01.domain.tld state-unknown 0 0 0 0kb/0kb 0/0 0/0 0/0 –
headnode.domain.tld down 0 0 0 503gb/503gb 256/256 0/0 0/0 –

And no pbs_mom is not down, it’s running.

I’m considering installing the PTL release, is there any test that you could recommend to check if there’s something broken within the binaries?

Thanks.

Thank you @ferrao . I would say re-installation would be useful.