I think we’re conflating a couple of issues.
And as a result we’re also using two different names…PBS_MOM_NODE_NAME and PBS_MOM_HOST_NAME.
PBS_MOM_NODE_NAME is not used to fix the issue with PBS_LEAF_NAME that was mentioned in the original post. It is used to ensure that when the MoM starts up it uses a name for the natural node that is consistent with the name that was used on the server to create the node.
I.e. if you have a host with this in /etc/hosts:
192.168.2.11 nid0001 preproc01
It allows you to use qmgr -c “create node preproc01” on the server without ill effects.
The problem arises because when MoM starts up it builds a list of vnodes. It is either only the natural node or it is a list of local vnodes (either configured with a v2 configuration file or with an exechost_startup or even exechost_periodic hook). Obviously to build this list it cannot check what the server thinks is the name of the natural node --the server may even not be running at all! So by default it assumes that the name of the natural node is the (non-canonicalized) hostname.
So if sites want to use an alias (or even a name bound to another IP address owned by the host!) to create nodes on hosts, instead of the “official” hostname, we need a way to tell MoM what we’ve done on the server at startup.
Otherwise many odd things happen - if e.g. you manipulate the vnode list in an exechost_startup hook, the vnode list on MoM becomes a list with the natural name named after the output of “hostname”, and when the UPDATE2 message hits the server it will render the ‘original’ node created on the server stale and enforce the new (and presumably not improved!) naming for the vnodes. Which, of course, means that any resources set using qmgr on the original vnode are now on a stale vnode…
Hence the name PBS_MOM_NODE_NAME. It’s the name of the vnode when the hostname is actually different. In the code, it does not correspond to “mom_host”, which is the canonicalized hostname, but mom_short_name.
So it should, in my opinion, never be called PBS_MOM_HOST_NAME since it is actually used when it is NOT the hostname.
The problem with multihost jobs is actually an internal problem that arose when PBS_LEAF_NAME was created. It does not concern this variable. PBS_LEAF_NAME changes the address with which MoM registers with pbs_comm. As a result, it will also change the Mom= attribute or the vnodes involved on the server. That, in turn, will change what is used in exec_Host2 attributes of jobs, and the code erroneously checks if it’s “part of the job” by matching its canonicalized hostname against it. Instead, it should be checking for a match against the canonicalized PBS_LEAF_NAME. That will work always provided that name resolution is consistent across the cluster (i.e. if PBS_LEAF_NAME resolves to an address that is canonicalized to the same name on the MoM and the server).