PBS Design Changes for Shasta support

lisa-altair · February 28, 2019, 1:53am

Cray will be releasing a new Shasta supercomputer with new interfaces for PBS to use. We have created a design that covers the proposed changes to support Shasta. Please take a look:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/1086029883/PBS+Design+Changes+for+Shasta

There has been some discussion already about catching a KeyboardInterrupt being dangerous. Because if the hook doesn’t exit, then it will sit there and the job will remain in the ‘E’ state.
The problem is that there is no fail_action for the execjob_end hook. And we need to mark a node offline because if we alarm, it indicates there was some sort of problem. We have a couple of possible solutions:

Have a timeout for the whole hook, and set it to some amount of time shorter than the hook alarm. Once the timeout happens, we set the node offline as part of the hook. When the hook alarm happens, we don’t have to do anything special.
Catch the exception, but make sure we put a timeout on the portion of the hook that will mark the node offline. That way, in case something goes wrong and and take a long time, the hook will still end/exit.

mkaro · March 13, 2019, 8:35pm

Hello @lisa-altair,

We discussed the Cray Shasta design document at the architecture meeting last week. Here is my assessment of what we concluded:

Using a hook configuration file for managing configuration values is the right choice for scalability. Using node and/or server attributes for these values would force each mom to pull the data from the server, which would not scale well.
Please define the JACS acronym early on in the document.
When a Cray administrator (or service acting on their behalf) restarts MoM after node health has been restored, they should do so in polling mode to handle the case when other jobs may have been running when the MoM was taken down.
If MoM was taken down while returning output to the submission host, the copy operation may need to be restarted. PBS should already handle this, but you may want to make note of it.
Were other options discussed with regard to node health check failures? Killing MoM seems like a fairly drastic measure that could interrupt normal PBS operations for other jobs running on the node. Could we use signals to tell MoM that incoming jobs are not permitted, but existing jobs may continue to run? Then another signal to tell MoM to resume normal operation.

Questions of my own:

Might a Shasta site want different configuration values for different nodes or vnode types? We attempt to accommodate this in the cgroups hook configuration file. It has been suggested that we increase the flexibility of this method by allowing different groups of nodes to have entirely independent configurations by listing multiple configurations together with the nodes or vnode types they are assigned to.

lisa-altair · March 14, 2019, 12:59am

Hi @mkaro, thanks for the review comments.
We did discuss other options with regard to node health check failures. We agree that killing the MoM seems drastic. We discussed with Cray about marking the node “offline”, however, the problem is that PBS won’t know when to bring the node back up/online. Cray asked us how does PBS know when to bring up a node again on other platforms. And we said that on other platforms the MoM goes down/up with the node. Cray decided to mimic that behavior. Without a way to query the state of a node, there is no automatic way for PBS to bring a MoM/node back online.

We did think about the potential for different nodes having different hook configurations, however, it seemed like over engineering for a use case we don’t currently have (and aren’t sure we are going to have).

mkaro · March 14, 2019, 1:07pm

Thanks for making the minor changes to the design. Please note that the intent for the -p parameter is to poll for previously running jobs since the parent/child relationship between MoM and the job process no longer exists. I suggest you update the sentence you added.

vstumpf · March 25, 2019, 11:45pm

I have made an update to this design.
The configuration is now a JSON object instead of a list of key-value pairs.

For example, the configuration will look like this:

{
    "apconfig": {
        "beginjob_timeout": 30,
        "unix_socket_file": "/var/run/apconfig/apconfig.sock",
    },
    "healthcheck": {
        "enabled": true,
        "interval": 0.4,
        "unix_socket_file": "/var/run/apnhs/apnhs.sock",
    }
}

bhroam · March 26, 2019, 8:49pm

I like using json for the new hook config format.

lsubramanian · March 27, 2019, 11:07pm

The changes look good to me

lisa-altair · April 1, 2019, 11:00pm

I agree with the changes

vstumpf · March 26, 2020, 6:39pm

I’ve updated the design to clarify that all vnodes reported by a mom will be offlined if the hook wants to offline nodes. This is done via the fail_action hook attribute.

lisa-altair · March 26, 2020, 7:45pm

The design changes makes sense to me. Thanks for updating it.
I think it is sufficient to mention once that the offlining of vnodes will be done via the fail_action offline_vnodes. It is not necessary to keep mentioning it everywhere.
Perhaps you can add a section about the default settings of the PBS_cray_atom.HK file and then the fail_action setting can be mentioned there?

vstumpf · March 26, 2020, 8:06pm

Unless it really detracts from the design, I would rather leave it in.

That’s a good idea, I’ll add it to the design.

lisa-altair · March 27, 2020, 12:12am

The new addition looks good.
I was trying to say that once you added the fail_action=offline you didn’t have to keep adding “(fail_action = offline_vnodes)” every where. But if you want to leave it, that’s fine.
The design looks good to me. Thanks!

Topic		Replies	Views
Avoid using an unstable node Users/Site Administrators	4	837	April 24, 2018
PP-610: On a Cray X-series, periodically synchronize PBS with ALPS inventory Developers	17	2071	June 15, 2017
Hook to Take Nodes Offline Users/Site Administrators	11	2521	June 4, 2019
PP-928: Reliable Job Startup Developers	44	3959	September 20, 2018
External design document for PP-824: Cray - Ramp rate limiting Developers	57	3392	February 1, 2018

PBS Design Changes for Shasta support

Related topics