A design proposal for adding a node state change hook event has just been created. The purpose is to enable admins to deploy site-specific scripts that execute when a node changes state.
Hey @monkeystate,
Thanks for taking up this endeavour, it will be a great addition to our hook infrastructure.
I have a few suggestions.
We usually expose things on the event, not off of pbs itself. How about you expose the old/new node states off of pbs.event() (pbs.event().old_state/pbs.event.new_state)
This hook could be a lot more beneficial if we expose the whole node as well. This way you can do more than just account for time in different states. You can have pbs.event().node, just like other hooks hav pbs.event().job. If you do this, you wonāt have to keep track of the amount of time since the last node state. This is already an attribute on the node itself (last_state_change_time)
I know the design is WIP, but maybe you could add a section on exactly what is changing for hook writers, and how they will use the hook. The internals you have are also useful, but not so much for doc team. Also include what happens when you accept/reject this hook. My guess is nothing special other than a log message, but it will be good to explicitly say.
The design is still a WIP and is ahead of the source code in our repo, but the document does reflect ideas we plan to implement next. Feedback on this updated draft would be appreciated.
Wrt your suggestions:
We agree; my initial design draft incorrectly stated our position on this issue. The new draft explicitly specifies that state change data is exposed via pbs.event().vnode.state_change
The updated design exposes vnode attributes in addition to the state change data.
I added a first draft āInfo for hook writersā subsection to the Technical Details section that will continue to evolve with the implementation.
Hey @monkeystate,
Thank you for your updates to your document. I have a few more suggestions.
I think the section for hook writers needs a little more information. Iād rather not see the changes to the hook event only be in the example.
I suggest the following additions
add a bullet talking about what is in pbs.event().state_change. I assume it is just new_state and old_state, but being explicit is good.
All events have a type exposed in the hook interface so it is possible to write one hook for multiple hook events. pbs.event() tells you the type so you know which event triggered you so you can do different things. Could you say what this is?
Explicitly say what will happen if you accept() or reject() the hook. I could see people assuming a reject() can do anything from actually stopping the state change, to just printing a message in the server log.
In the internals section lists a new structure for the state change. It has a timestamp in it. Iām wondering what this is for? If it is so you can determine how long it was between the last state change and now, that is not required. There is an attribute ālast_state_change_timeā which is the timestamp of the last time the state changed. As long as you trigger the hook prior to the actual state change, this will be the last time the state changed.
Something to think about is you donāt necessarily need the old state as part of the event. You are exposing the vnode attributes, and the state is part of it. Although it canāt hurt having it there.
Iāve looked into your design. So weāll need the actual name of the hook event (e.g. queuejob, modifyjob, periodic, etcā¦), and the parameters accompanying the hook event. So the way I read it, is that you want the hook event to run the hook script whenever thereās a change in state of the node. I was thinking maybe making this more generic. That is, a hook that runs whenever thereās an update to a vnode attribute, although for now, only the āstateā change value would trigger this hook. How about this idea:
name of hook event: modifyvnode
Python global constant: pbs.MODIFYVNODE
Parameters:
pbs.event().vnode_o - vnode object view of the vnode before the state change (original state or other attribute values)
pbs.event().vnode - vnode object view now, including the state change (and other attributes)
So in the Python script, one would see:
pbs.event().vnode_o.state - example pbs.ND_FREE <- old state value
pbs.event().vnode.state - example pbs.ND_DOWN <- new state value
State values actually already have global constants that can be matched as in:
pbs.ND_FREE
pbs.ND_OFFLINE
pbs.ND_DOWN
pbs.ND_STALE
pbs.ND_JOBBUSY
pbs.ND_JOB_EXCLUSIVE
pbs.ND_RESV_EXCLUSIVE
pbs.ND_BUSY
pbs.ND_STATE_UNKNOWN
pbs.ND_PROV
pbs.ND_WAIT_PROV
pbs.ND_UNRESOLVABLE
pbs.ND_SLEEP
@toonen@monkeystate you suggested (in a meeting) about adding a āfilterā criteria so that such hooks do not get triggered too frequently. That sounded like a very cool idea.
How about we expand the filter idea to all hooks? Of course that can be a separate change/proposal and does not need to be clubbed with this one.
I agree it will be useful for overall performance of the server in cases where lots of object updates are happening (thousands of nodes changing states and other attribute values and thousands of jobs updating attributes every second). A similar case already happens with modifyjob hook wherein the scheduler updates a large number of jobs with a comment stating why it could not run it. If a modifyjob hook is present, it could get triggered for each of those comment updates, whereas it is possible that the particular hook does not care about that - and if we could bypass setting up the whole python environment upfront, it would save us quite some compute cost.
Of course, the filter has to be a simple one like ācall this hook if there is a change in these attributes onlyā. If we wanted a dynamic āformulaā like thing, we might need the python interpreter in the first place!
The third iteration of the node state change design doc has just been published. The design is still a WIP (for example the content of the proposed pbs log entry may yet change) but itās closer to its final state. The design is backed by a first draft WIP implementation that is a fair representation of the approach.
(Note as previously mentioned the design for a general hook filter is out of scope for this design and therefore not included in this iteration.)
@monkeystate the current design doc looks good. It also keeps a path open for the future where more than node state changes can be triggering the hook (if and when such a requirement arises).
Please note that in this iteration two new āstate listā functions have been added to the python vnode object. Otherwise the design is largely unchanged from previous versions.