There has been some discussion already about catching a KeyboardInterrupt being dangerous. Because if the hook doesn’t exit, then it will sit there and the job will remain in the ‘E’ state.
The problem is that there is no fail_action for the execjob_end hook. And we need to mark a node offline because if we alarm, it indicates there was some sort of problem. We have a couple of possible solutions:
Have a timeout for the whole hook, and set it to some amount of time shorter than the hook alarm. Once the timeout happens, we set the node offline as part of the hook. When the hook alarm happens, we don’t have to do anything special.
Catch the exception, but make sure we put a timeout on the portion of the hook that will mark the node offline. That way, in case something goes wrong and and take a long time, the hook will still end/exit.
We discussed the Cray Shasta design document at the architecture meeting last week. Here is my assessment of what we concluded:
Using a hook configuration file for managing configuration values is the right choice for scalability. Using node and/or server attributes for these values would force each mom to pull the data from the server, which would not scale well.
Please define the JACS acronym early on in the document.
When a Cray administrator (or service acting on their behalf) restarts MoM after node health has been restored, they should do so in polling mode to handle the case when other jobs may have been running when the MoM was taken down.
If MoM was taken down while returning output to the submission host, the copy operation may need to be restarted. PBS should already handle this, but you may want to make note of it.
Were other options discussed with regard to node health check failures? Killing MoM seems like a fairly drastic measure that could interrupt normal PBS operations for other jobs running on the node. Could we use signals to tell MoM that incoming jobs are not permitted, but existing jobs may continue to run? Then another signal to tell MoM to resume normal operation.
Questions of my own:
Might a Shasta site want different configuration values for different nodes or vnode types? We attempt to accommodate this in the cgroups hook configuration file. It has been suggested that we increase the flexibility of this method by allowing different groups of nodes to have entirely independent configurations by listing multiple configurations together with the nodes or vnode types they are assigned to.
Hi @mkaro, thanks for the review comments.
We did discuss other options with regard to node health check failures. We agree that killing the MoM seems drastic. We discussed with Cray about marking the node “offline”, however, the problem is that PBS won’t know when to bring the node back up/online. Cray asked us how does PBS know when to bring up a node again on other platforms. And we said that on other platforms the MoM goes down/up with the node. Cray decided to mimic that behavior. Without a way to query the state of a node, there is no automatic way for PBS to bring a MoM/node back online.
We did think about the potential for different nodes having different hook configurations, however, it seemed like over engineering for a use case we don’t currently have (and aren’t sure we are going to have).
Thanks for making the minor changes to the design. Please note that the intent for the -p parameter is to poll for previously running jobs since the parent/child relationship between MoM and the job process no longer exists. I suggest you update the sentence you added.
I’ve updated the design to clarify that all vnodes reported by a mom will be offlined if the hook wants to offline nodes. This is done via the fail_action hook attribute.
The design changes makes sense to me. Thanks for updating it.
I think it is sufficient to mention once that the offlining of vnodes will be done via the fail_action offline_vnodes. It is not necessary to keep mentioning it everywhere.
Perhaps you can add a section about the default settings of the PBS_cray_atom.HK file and then the fail_action setting can be mentioned there?
The new addition looks good.
I was trying to say that once you added the fail_action=offline you didn’t have to keep adding “(fail_action = offline_vnodes)” every where. But if you want to leave it, that’s fine.
The design looks good to me. Thanks!