Proper way to configure PBS on multiple NIC system

runapp · March 14, 2019, 9:01am

I’ve always been confused with all the hostname stuffs, and today I determined to dig into part of the source code after being tired about endless tries.
I suppose it’s common to see a cluster running PBS, where the master node has two NICs, one for external netwrok, and one for intra-cluster communication. Let’s say, the cluster was named “HPC”. So it’s reasonable to configure the hosts and other files as follows:

#hosts
127.0.0.1   localhost
1.2.3.4     HPC.your.domain.name           HPC
10.0.0.1   node0.local node0   node-mgmt

#hosts.equiv
HPC
HPC.your.domain.name
node0
node-mgmt

#hostname
HPC.your.domain.name
#$(hostname) -> HPC
#$(hostname -f) -> HPC.your.domain.name

And then set the pbs.conf on the master node like this:

#pbs.conf on master node
PBS_SERVER=HPC
PBS_LEAF_NAME=node0

On the execution node, pbs.conf looks like:

#pbs.conf on execution node
PBS_SERVER=node0
PBS_LEAF_NAME=node1

However these configuration would result in some of the commands returns PBS Internal Errors. After checking the source code, it seems that client-side commands, such like qrun XXX, is connecting to the server from external address, 1.2.3.4, rathern than 127.0.0.1 or 10.0.0.1. However, pbs_sched, who actually receives the request, only accepts hostnames localhost and node0 from PBS_LEAF_NAME. So it will cut off the connection, thinking it unauthorized.

Besides the client-side commands, it also prevents the sched from actually schedule any jobs, cause the pbs_server is also considered unauthorized. I’m not sure about it, for I haven’t dig into pbs_server's codebase. However I did see logs in sched_logs folder contains ‘pbs_sched: badconn, node0 on port 661 unauthorized host’ every a few minutes.

I totally don’t understand why PBS are designed in this way. I’ve read the admin book and didn’t found helpful stuff. Is there any special considerations? Is my configuration of hostnames wrong? Or did I miss out something? Any help will be appreciated.

adarsh · March 19, 2019, 1:57pm

Please try this

make sure /etc/hosts is well populated across all the nodes ( DNS , reverse DNS is workign fine)
qmgr -c “set server flatuid=true”

edit $PBS_HOME/mom_priv/config (restart pbs_mom services after updating this file)
$clienthost HPC
$clienthost node0
create a file called clientfile on the PBS Server/Scheduler host in the below location
/var/spool/pbs/sched_priv/clientfile

cat /var/spool/pbs/sched_priv/clientfile
$clienthost node0

Start the PBS Scheduler as below manually or by updating the startup scripts
/opt/pbs/sbin/pbs_sched -c /var/spool/pbs/sched_priv/clientfile
[ if the pbs_sched has already started, then kill the pbs_sched daemon, and start it manually as above ]

Topic		Replies	Views
Run OpenPBS in private network Users/Site Administrators	13	367	April 21, 2024
Installation with two computers Users/Site Administrators	2	856	November 1, 2019
Start pbs does not work Users/Site Administrators	11	4746	May 21, 2020
Set or not set PBS_LEAF_NAME parameter Users/Site Administrators	8	1794	July 29, 2021
Is it possible to run 1 pbs_server to control 2 separate subnets? Users/Site Administrators	3	999	June 13, 2019

Proper way to configure PBS on multiple NIC system

Related topics