Design document for node state change hook event

monkeystate · July 31, 2020, 7:43pm

Hello,

A design proposal for adding a node state change hook event has just been created. The purpose is to enable admins to deploy site-specific scripts that execute when a node changes state.

This is a WIP and feedback would be appreciated.

Thank you

bhroam · July 31, 2020, 8:39pm

Hey @monkeystate,
Thanks for taking up this endeavour, it will be a great addition to our hook infrastructure.

I have a few suggestions.

We usually expose things on the event, not off of pbs itself. How about you expose the old/new node states off of pbs.event() (pbs.event().old_state/pbs.event.new_state)
This hook could be a lot more beneficial if we expose the whole node as well. This way you can do more than just account for time in different states. You can have pbs.event().node, just like other hooks hav pbs.event().job. If you do this, you won’t have to keep track of the amount of time since the last node state. This is already an attribute on the node itself (last_state_change_time)
I know the design is WIP, but maybe you could add a section on exactly what is changing for hook writers, and how they will use the hook. The internals you have are also useful, but not so much for doc team. Also include what happens when you accept/reject this hook. My guess is nothing special other than a log message, but it will be good to explicitly say.

Bhroam

monkeystate · August 3, 2020, 3:24pm

Hello @bhroam,

Thanks so much for your suggestions! We’ll be following up shortly.

monkeystate · August 19, 2020, 3:31pm

Hello @bhroam,

An update to the node state change design doc has just been published. This represents recent discussions among myself, @pershey, @toonen, @weallcock, as well as others.

The design is still a WIP and is ahead of the source code in our repo, but the document does reflect ideas we plan to implement next. Feedback on this updated draft would be appreciated.

Wrt your suggestions:

We agree; my initial design draft incorrectly stated our position on this issue. The new draft explicitly specifies that state change data is exposed via pbs.event().vnode.state_change
The updated design exposes vnode attributes in addition to the state change data.
I added a first draft “Info for hook writers” subsection to the Technical Details section that will continue to evolve with the implementation.

Thank you

bhroam · August 20, 2020, 12:24am

Hey @monkeystate,
Thank you for your updates to your document. I have a few more suggestions.

I think the section for hook writers needs a little more information. I’d rather not see the changes to the hook event only be in the example.
I suggest the following additions

add a bullet talking about what is in pbs.event().state_change. I assume it is just new_state and old_state, but being explicit is good.
All events have a type exposed in the hook interface so it is possible to write one hook for multiple hook events. pbs.event() tells you the type so you know which event triggered you so you can do different things. Could you say what this is?
Explicitly say what will happen if you accept() or reject() the hook. I could see people assuming a reject() can do anything from actually stopping the state change, to just printing a message in the server log.

In the internals section lists a new structure for the state change. It has a timestamp in it. I’m wondering what this is for? If it is so you can determine how long it was between the last state change and now, that is not required. There is an attribute ‘last_state_change_time’ which is the timestamp of the last time the state changed. As long as you trigger the hook prior to the actual state change, this will be the last time the state changed.

Something to think about is you don’t necessarily need the old state as part of the event. You are exposing the vnode attributes, and the state is part of it. Although it can’t hurt having it there.

Bhroam

bayucan · August 20, 2020, 10:41pm

I’ve looked into your design. So we’ll need the actual name of the hook event (e.g. queuejob, modifyjob, periodic, etc…), and the parameters accompanying the hook event. So the way I read it, is that you want the hook event to run the hook script whenever there’s a change in state of the node. I was thinking maybe making this more generic. That is, a hook that runs whenever there’s an update to a vnode attribute, although for now, only the ‘state’ change value would trigger this hook. How about this idea:
name of hook event: modifyvnode
Python global constant: pbs.MODIFYVNODE

Parameters:
pbs.event().vnode_o - vnode object view of the vnode before the state change (original state or other attribute values)
pbs.event().vnode - vnode object view now, including the state change (and other attributes)

So in the Python script, one would see:
pbs.event().vnode_o.state - example pbs.ND_FREE <- old state value
pbs.event().vnode.state - example pbs.ND_DOWN <- new state value

State values actually already have global constants that can be matched as in:
pbs.ND_FREE
pbs.ND_OFFLINE
pbs.ND_DOWN
pbs.ND_STALE
pbs.ND_JOBBUSY
pbs.ND_JOB_EXCLUSIVE
pbs.ND_RESV_EXCLUSIVE
pbs.ND_BUSY
pbs.ND_STATE_UNKNOWN
pbs.ND_PROV
pbs.ND_WAIT_PROV
pbs.ND_UNRESOLVABLE
pbs.ND_SLEEP

monkeystate · August 25, 2020, 1:06pm

Hello @bhroam and @bayucan,

Thanks very much for your continued feedback!

We like patterns, and following an existing approach (such as is used for modifyjob) is very interesting. We’ll be following up shortly.

subhasisb · August 27, 2020, 4:14am

@toonen @monkeystate you suggested (in a meeting) about adding a “filter” criteria so that such hooks do not get triggered too frequently. That sounded like a very cool idea.

How about we expand the filter idea to all hooks? Of course that can be a separate change/proposal and does not need to be clubbed with this one.

I agree it will be useful for overall performance of the server in cases where lots of object updates are happening (thousands of nodes changing states and other attribute values and thousands of jobs updating attributes every second). A similar case already happens with modifyjob hook wherein the scheduler updates a large number of jobs with a comment stating why it could not run it. If a modifyjob hook is present, it could get triggered for each of those comment updates, whereas it is possible that the particular hook does not care about that - and if we could bypass setting up the whole python environment upfront, it would save us quite some compute cost.

Of course, the filter has to be a simple one like “call this hook if there is a change in these attributes only”. If we wanted a dynamic “formula” like thing, we might need the python interpreter in the first place!

monkeystate · August 28, 2020, 2:12pm

Hello @subhasisb,

Thanks for your very useful feedback! As you suggest we will open a separate topic in the near future to further explore this idea.

Thank you

monkeystate · September 21, 2020, 7:30pm

Hello @bhroam, @bayucan, and @subhasisb,

The third iteration of the node state change design doc has just been published. The design is still a WIP (for example the content of the proposed pbs log entry may yet change) but it’s closer to its final state. The design is backed by a first draft WIP implementation that is a fair representation of the approach.

(Note as previously mentioned the design for a general hook filter is out of scope for this design and therefore not included in this iteration.)

Feedback would be appreciated.

Thank you,
@monkeystate, @pershey, @pmrich, @sdass, @toonen, @weallcock

bayucan · September 23, 2020, 3:46pm

@monkeystate : current design looks good. Thanks for taking up my suggestions.

subhasisb · September 24, 2020, 5:37am

@monkeystate the current design doc looks good. It also keeps a path open for the future where more than node state changes can be triggering the hook (if and when such a requirement arises).

bhroam · September 24, 2020, 7:06pm

Thanks for making all the changes @monkeystate. The design looks good to me.

Bhroam

monkeystate · November 4, 2020, 3:12pm

Hello @bhroam, @bayucan, and @subhasisb,

The node state change design doc is no longer a wip and now includes a pull request! The proposed implementation is the result of a collaboration between @pershey, @toonen, and myself.

Please note that in this iteration two new “state list” functions have been added to the python vnode object. Otherwise the design is largely unchanged from previous versions.

Your consideration would be appreciated.

Thank you

bayucan · November 6, 2020, 11:00pm

@monkeystate The design is looking good. I’ve reviewed the PR.

Topic		Replies	Views
Server Management Hook Event Developers	3	1188	December 11, 2019
External design document for PP-824: Cray - Ramp rate limiting Developers	57	3392	February 1, 2018
Design document for endjob hook event Developers	21	1227	May 17, 2022
PP-425 to PP-434 - Server Periodic hooks support Developers	31	3621	December 22, 2016
Hook to Take Nodes Offline Users/Site Administrators	11	2519	June 4, 2019

Design document for node state change hook event

Related topics