I’ve always been confused with all the hostname stuffs, and today I determined to dig into part of the source code after being tired about endless tries.
I suppose it’s common to see a cluster running PBS, where the master node has two NICs, one for external netwrok, and one for intra-cluster communication. Let’s say, the cluster was named “HPC”. So it’s reasonable to configure the hosts and other files as follows:
#hosts
127.0.0.1 localhost
1.2.3.4 HPC.your.domain.name HPC
10.0.0.1 node0.local node0 node-mgmt
#hosts.equiv
HPC
HPC.your.domain.name
node0
node-mgmt
#hostname
HPC.your.domain.name
#$(hostname) -> HPC
#$(hostname -f) -> HPC.your.domain.name
And then set the pbs.conf
on the master node like this:
#pbs.conf on master node
PBS_SERVER=HPC
PBS_LEAF_NAME=node0
On the execution node, pbs.conf
looks like:
#pbs.conf on execution node
PBS_SERVER=node0
PBS_LEAF_NAME=node1
However these configuration would result in some of the commands returns PBS Internal Errors
. After checking the source code, it seems that client-side commands, such like qrun XXX
, is connecting to the server from external address, 1.2.3.4, rathern than 127.0.0.1 or 10.0.0.1. However, pbs_sched
, who actually receives the request, only accepts hostnames localhost
and node0
from PBS_LEAF_NAME
. So it will cut off the connection, thinking it unauthorized.
Besides the client-side commands, it also prevents the sched from actually schedule any jobs, cause the pbs_server
is also considered unauthorized. I’m not sure about it, for I haven’t dig into pbs_server
's codebase. However I did see logs in sched_logs
folder contains ‘pbs_sched: badconn, node0 on port 661 unauthorized host’ every a few minutes.
I totally don’t understand why PBS are designed in this way. I’ve read the admin book and didn’t found helpful stuff. Is there any special considerations? Is my configuration of hostnames wrong? Or did I miss out something? Any help will be appreciated.