Allow dots in PBS_MOM_NODE_NAME

Hi all,

I have come up with the following design to allow dots in the value of PBS_MOM_NODE_NAME variable in pbs.conf file.

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1824423937/Allow+dots+in+PBS+MOM+NODE+NAME

Please let me know of any comments and suggestions.

Thank you,
Minghui

That is only part of the issue.

The design document that introduced PBS_MOM_NODE_NAME overloaded the first prototype implementation of PBS_MOM_NODE_NAME to also use this when there was a gethostname() failure (in other words, on hosts where gethostname() fails you could still make PBSPro start by specifying PBS_MOM_NODE_NAME).

By doing that, it also introduced some restrictions that are unnecessary for the MoM short name; in particular, while DNS hostnames can indeed not contain underscores, there are many non-DNS resolver frameworks in which hostnames with an underscore are valid, and some people will naturally disambiguate node0001 from clusters A and B as node0001_clustera and node0001_clusterb (your design proposal would make node0001.clustera work, in addition to what works now, which is node0001-clustera).

Certainly, the qmgr command does not enforce vnode names for natural node names to be valid hostnames are per the DNS RFCs 952 (amended by RFC 1123).

I would think that what makes most sense is to only enforce the “must comply with RFC952/1123” rules when there is a failure for gethostname. If it is not a hostname then we can just accept it and use the result of gethostname() as the hostname and simply accept PBS_MOM_NODE_NAME as the mom shortname. After all, nothing in the naming of the variable suggests it’s being overloaded for other purposes.

The reference documentation also does not state precisely what a “legal name for a host” is: we COULD refer to RFC 952 or RFC 1123 or quote the relevant section to define it, but we don’t. It is possibly a documentation error that should be corrected, but then ideally qmgr should also enforce that on the “create node” comment, to avoid sites painting themselves in a corner without realising it (until they start writing hooks, which may come a lot later).

Since being more strict than what history allowed is always a bit troublesome, regardless of how we fix documentation (and qmgr for “really new” nodes that are created afresh rather than picked up from an old datastore, where we could still pick up vnode names that are illegal when new), and since if people use a syntax that is in theory “not allowed” we don’t have to punish them, I would favour accepting non-RFC952 names for the PBS_MOM_NODE_NAME verbatim, except if gethostname() failure forces us to use it as a hostname.

See e.g.

https://books.google.be/books?id=-afmBwAAQBAJ&pg=PA382&lpg=PA382&dq=LDAP+hostname+underscore&source=bl&ots=JfchNgzMjC&sig=ACfU3U35ZrwQJYGZ4lZKBlbEjOm7U-ABeg&hl=en&sa=X&ved=2ahUKEwjOqdeM9vLpAhXViFwKHc18BwwQ6AEwA3oECAgQAQ#v=onepage&q=LDAP%20hostname%20underscore&f=false

and you’ll see that OpenLDAP is documented as accepting underscores in valid hostnames as well.

Hi Alexis,

Let me know if I got what you were saying correctly:

If gethostname() succeeded, use its return value as hostname, and use PBS_MOM_NODE_NAME verbatim as the mom short name (it does not need to comply with RFC 952/1123).
If gethostname() failed, we use PBS_MOM_NODE_NAME as the hostname too. In this case PBS_MOM_NODE_NAME needs to comply with RFC 952/1123.

Yes, that’s a perfect summary (I knew I was being too verbose :wink: ).

Of course, as far as the everything-after-dot-removal is concerned: only remove everything after the dots if PBS_MOM_NODE_NAME was not set.

Thanks. The original design document says:

If administrators want to use an alias (or a name bound to another IP address on the host) to create nodes rather than the default hostname, PBS_MOM_NODE_NAME provides them with the ability to override the default hostname.

According to what you said, this will no longer be true. PBS_MOM_NODE_NAME will serve as a back up solution for mom to obtain its hostname. So this part of the original design needs to be revised.

That is a misreading of the original design document.

“create nodes” refers to the natural node name given in the “create nodes” command in qmgr, and the natural node when MoM starts up is controlled by the MoM short name, which indeed needs to be set to PBS_MOM_NODE_NAME, but is not required to be a hostname.

The original design document does indeed need to be revisited, but in THIS section:

" If PBS_MOM_NODE_NAME is unset and the call to gethostbyname [sic] fails OR if PBS_MOM_NODE_NAME is set and the value does not conform to RFCs 952 and 1123, the following message will be printed to the log:
Unable to obtain my host name"

But this line is already internally inconsistent with the heading above it, which reads “Log messages when MoM fails to identify its hostname” [emphasis mine].

MoM only fails to identify its hostname if gethostname() fails. [Incidentally, the design document incorrectly refers to gethostbyname(), which is to find a host when you know the name, but here we are trying to find the name in the first place.]

So if gethostname() succeeds, there is no need for PBS_MOM_NODE_NAME to comply with hostname RFCs.

So that suggests that the internally consistent paragraph there would be:

“If the call to gethostname fails and PBS_MOM_NODE_NAME is set and the value does not conform to RFCs 952 and 1123, the following message will be printed to the log: Unable to obtain my host name”

The implementation also has important implications that do not derive from the design document: if PBS_MOM_NODE_NAME is set, it also sets mom_host to PBS_MOM_NODE_NAME, even if gethostname() succeeds.

That is not mandated by either of the interface headings of the design document nor even the sentence that is inconsistent with the heading; indeed, when you read the heading “when MoM fails to identify its hostname”, to me it implies that this is only done ON THE CONDITION that gethostname has failed (at least that is what the vernacular “when” suggests to me).

That was, BTW, already discussed with the original author of that EDD and he agreed. There is a code fragment in an internal JIRA ticket that was encoding what we both thought made most sense for an internally consistent design document and implementation.

And yes, I was a reviewer for that design document (having come up with the first but non-official PBS_MOM_NODE_NAME implementation) and I missed the fact that the second interface’s heading implied something different from what was written within that section. I read the last paragraph of that sentence as what I thought it would be with the heading in mind.

Incidentally, the previously existing everything-after-dot removal does make sense to get the mom_shortname provided that PBS_MOM_NODE_NAME was not specified.

The intent is to indeed make vnode names short by default, except if there is a reason to make them long (because some software expects it to be a FQDN and/or because the short name would not be unique in the cluster, if there is e.g. both a node0001.clustera and node0001.clusterb host, or because there is no name except for the IP address which is only complete when it is long).

And PBS_MOM_NODE_NAME is the mechanism to make them “long” if necessary. Or different (e.g. on Cray X* systems, to use mnemonic names for MoMs tied to the addresses rather than the nidXXXX names that can change if a node goes down and its function is taken over by different hardware).

I updated the design document:

https://pbspro.atlassian.net/wiki/spaces/PD/pages/1824423937/Revision+to+PBS+MOM+NODE+NAME+configuration+variable

Please let me know what you think. Thanks.