PP-610: On a Cray X-series, periodically synchronize PBS with ALPS inventory

This message is to inform the community that there is a new design document available for review:


Please provide your comments.


I have a few questions, the answers can be used to clarify the external design:

  • The external design says the hook will “restart the Mom”, it seems like overkill to restart the Mom when sending a HUP would be enough to get the PBS MoM to ask ALPS for new inventory. Is there a reason we need to restart the MoM?

  • Will the hook run on non-Cray systems? If it does, will the hook do nothing, or will there be errors? The hook seems useful for the Cray X-series, can it be enabled by default? But not enabled by default for other platforms?

  • Will there be a timeout for the hook?

  • For interfaces #10 and #11, what is printed for the second <name>?

I should update the document to say ‘HUP the Mom’ instead of ‘restart the Mom’, since a HUP is what the Hook will be doing (SIGHUP is trapped and eventually a new ALPS inventory call gets made).

The current design is for Cray systems. If others are aware of possible uses in non-Cray environments (and enabling/disabling the hook depending on platform), please comment.

The timeout for the hook is 90 seconds. I will update the document to include this.

For interface #s 10 and 11, it should say ‘list of nodes’ instead of ‘name’, since we are listing out the node id(s) that PBS and ALPS differ by.

As pointed out in the PP-586 discussion, even Cray systems often include non-Cray MOMs:

I believe a PBS hook can only be enabled/disabled for the entire PBS cluster, so the design should ensure that the hook behaves sanely on non-Cray MOMs as well as doing it’s job on Cray X-series MOMs. (Detecting that it is not running on an X-series MOM and exiting gracefully would be one approach.)

An ‘xthostname’ file (present on a Cray), can be used to make this determination i.e. if not a Cray host, then exit.

Great. The design should probably be updated to state that the hook will only do it’s work on Cray X-series MOMs.

The EDD page v.4 looks good to me.

What are the possible values for < host name > in the message:

Interface 3: Mom log entry: No < host name > file found on this host

For the message in interface 7 the first instance of < name > is the hostname of the login node responsible for performing the inventory query while the 2nd instance of < name > is the hostname of the current/local mom, correct?

What, specifically, do the 2 instances of < list of nodes > represent in interfaces 10 and 11? It looks like the first is the list of “unknown” nodes but what is the second, the entire list of “known” nodes?

Interface 14 Details still refers to “restarting” the mom

Thanks for your comments.

Interface 3: ‘host name’ refers to the “/etc/xthostname” file.

Interface 7: Your comment is accurate.

Interface 10 and 11: I will update this statement in the EDD, since ‘list of nodes’ got repeated inadvertently. ‘list of nodes’ refers to ‘unknown’ nodes.

Interface 14: I will update the ‘details’ section.

Interface 3: So why not just say “xthostname” in the message?

Interface 7: Maybe add something like my comment to the details? Having the same label in the message description can be confusing.

Interface 3: I kept the interface generic (i.e. ‘host name’) since i’m already explaining in the details section that the hook looks for the ‘/etc/xthostname’ file.

Interface 7: I have updated the details section based on your comment.

EDD is acceptable to me

Design proposal looks good to me.

The EDD page v.6 looks good to me.

The hook should be automatically enabled on Cray X* series. But disabled elsewhere. I have updated the external design to reflect this. Please have a look.

Thanks for making that change to the EDD, it looks good!

@lisa-altair change looks good.

Updates in EDD page v.7 looks good to me.