Design document for node state change hook event

Hello,

A design proposal for adding a node state change hook event has just been created. The purpose is to enable admins to deploy site-specific scripts that execute when a node changes state.

This is a WIP and feedback would be appreciated.

Thank you

Hey @monkeystate,
Thanks for taking up this endeavour, it will be a great addition to our hook infrastructure.

I have a few suggestions.

  1. We usually expose things on the event, not off of pbs itself. How about you expose the old/new node states off of pbs.event() (pbs.event().old_state/pbs.event.new_state)
  2. This hook could be a lot more beneficial if we expose the whole node as well. This way you can do more than just account for time in different states. You can have pbs.event().node, just like other hooks hav pbs.event().job. If you do this, you won’t have to keep track of the amount of time since the last node state. This is already an attribute on the node itself (last_state_change_time)
  3. I know the design is WIP, but maybe you could add a section on exactly what is changing for hook writers, and how they will use the hook. The internals you have are also useful, but not so much for doc team. Also include what happens when you accept/reject this hook. My guess is nothing special other than a log message, but it will be good to explicitly say.

Bhroam

Hello @bhroam,

Thanks so much for your suggestions! We’ll be following up shortly.

Hello @bhroam,

An update to the node state change design doc has just been published. This represents recent discussions among myself, @pershey, @toonen, @weallcock, as well as others.

The design is still a WIP and is ahead of the source code in our repo, but the document does reflect ideas we plan to implement next. Feedback on this updated draft would be appreciated.

Wrt your suggestions:

  1. We agree; my initial design draft incorrectly stated our position on this issue. The new draft explicitly specifies that state change data is exposed via pbs.event().vnode.state_change

  2. The updated design exposes vnode attributes in addition to the state change data.

  3. I added a first draft “Info for hook writers” subsection to the Technical Details section that will continue to evolve with the implementation.

Thank you

Hey @monkeystate,
Thank you for your updates to your document. I have a few more suggestions.

I think the section for hook writers needs a little more information. I’d rather not see the changes to the hook event only be in the example.
I suggest the following additions

  1. add a bullet talking about what is in pbs.event().state_change. I assume it is just new_state and old_state, but being explicit is good.
  2. All events have a type exposed in the hook interface so it is possible to write one hook for multiple hook events. pbs.event() tells you the type so you know which event triggered you so you can do different things. Could you say what this is?
  3. Explicitly say what will happen if you accept() or reject() the hook. I could see people assuming a reject() can do anything from actually stopping the state change, to just printing a message in the server log.

In the internals section lists a new structure for the state change. It has a timestamp in it. I’m wondering what this is for? If it is so you can determine how long it was between the last state change and now, that is not required. There is an attribute ‘last_state_change_time’ which is the timestamp of the last time the state changed. As long as you trigger the hook prior to the actual state change, this will be the last time the state changed.

Something to think about is you don’t necessarily need the old state as part of the event. You are exposing the vnode attributes, and the state is part of it. Although it can’t hurt having it there.

Bhroam

1 Like

I’ve looked into your design. So we’ll need the actual name of the hook event (e.g. queuejob, modifyjob, periodic, etc…), and the parameters accompanying the hook event. So the way I read it, is that you want the hook event to run the hook script whenever there’s a change in state of the node. I was thinking maybe making this more generic. That is, a hook that runs whenever there’s an update to a vnode attribute, although for now, only the ‘state’ change value would trigger this hook. How about this idea:
name of hook event: modifyvnode
Python global constant: pbs.MODIFYVNODE

Parameters:
pbs.event().vnode_o - vnode object view of the vnode before the state change (original state or other attribute values)
pbs.event().vnode - vnode object view now, including the state change (and other attributes)

So in the Python script, one would see:
pbs.event().vnode_o.state - example pbs.ND_FREE <- old state value
pbs.event().vnode.state - example pbs.ND_DOWN <- new state value

State values actually already have global constants that can be matched as in:
pbs.ND_FREE
pbs.ND_OFFLINE
pbs.ND_DOWN
pbs.ND_STALE
pbs.ND_JOBBUSY
pbs.ND_JOB_EXCLUSIVE
pbs.ND_RESV_EXCLUSIVE
pbs.ND_BUSY
pbs.ND_STATE_UNKNOWN
pbs.ND_PROV
pbs.ND_WAIT_PROV
pbs.ND_UNRESOLVABLE
pbs.ND_SLEEP

1 Like

Hello @bhroam and @bayucan,

Thanks very much for your continued feedback!

We like patterns, and following an existing approach (such as is used for modifyjob) is very interesting. We’ll be following up shortly.

1 Like

@toonen @monkeystate you suggested (in a meeting) about adding a “filter” criteria so that such hooks do not get triggered too frequently. That sounded like a very cool idea.

How about we expand the filter idea to all hooks? Of course that can be a separate change/proposal and does not need to be clubbed with this one.

I agree it will be useful for overall performance of the server in cases where lots of object updates are happening (thousands of nodes changing states and other attribute values and thousands of jobs updating attributes every second). A similar case already happens with modifyjob hook wherein the scheduler updates a large number of jobs with a comment stating why it could not run it. If a modifyjob hook is present, it could get triggered for each of those comment updates, whereas it is possible that the particular hook does not care about that - and if we could bypass setting up the whole python environment upfront, it would save us quite some compute cost.

Of course, the filter has to be a simple one like “call this hook if there is a change in these attributes only”. If we wanted a dynamic “formula” like thing, we might need the python interpreter in the first place!

Hello @subhasisb,

Thanks for your very useful feedback! As you suggest we will open a separate topic in the near future to further explore this idea.

Thank you

Hello @bhroam, @bayucan, and @subhasisb,

The third iteration of the node state change design doc has just been published. The design is still a WIP (for example the content of the proposed pbs log entry may yet change) but it’s closer to its final state. The design is backed by a first draft WIP implementation that is a fair representation of the approach.

(Note as previously mentioned the design for a general hook filter is out of scope for this design and therefore not included in this iteration.)

Feedback would be appreciated.

Thank you,
@monkeystate, @pershey, @pmrich, @sdass, @toonen, @weallcock

1 Like

@monkeystate : current design looks good. Thanks for taking up my suggestions.

@monkeystate the current design doc looks good. It also keeps a path open for the future where more than node state changes can be triggering the hook (if and when such a requirement arises).

Thanks for making all the changes @monkeystate. The design looks good to me.

Bhroam