Design document for endjob hook event

Hi All,

A design proposal for adding an endjob hook event has been created. The purpose is to allow for recording job information and the time of when a job ends which is required for more detailed accounting of jobs for some sites.

Design Doc endjob hook event

This is a WIP and feedback would be appreciated.

Thanks!

@toonen @pershey

Hey @sdass
Thanks for writing this design document. I have a couple of comments.

  1. Does this hook fire on a job delete? If not, you might want it to. Right now the hook fires for every reason the job ends except if it gets deleted.
  2. Consider allowing at least some job elements to be modified. Wouldn’t it be more helpful in the endjob hook if you could modify what shows up in the accounting log? Right now all you can do is write to the server log, or do something external to PBS.
  3. In your example hook, you use %% for string formatting, when I think you want a single %.

Bhroam

@sdass Looks good so far.

  • Please specify the behavior when the endjob hook script encounters an unexpected error causing an unhandled exception, or if the script terminates due to a hook alarm. The resulting action will be like pbs.event().reject()

  • For the format of the new job attribute ‘endtime’, you can just say type ‘pbs.duration’ which is a formal type in PBS hooks.

Also wrt endtime, it is a duration from exactly when to when? Start of user job script to end of script? Duration when resources were allocated to job (i.e., including prologue and epilogue durations)?

Interesting discussion on endtime. How is it different than resources_used.walltime?

Similar to stime, endtime is set to seconds since the Epoch at the time the server is told/detects that the job or subjob has ended. etime was already being used for eligible time, so we named this attribute endtime instead.

In that case, the design document needs updating to indicate the format is epoch time, rather than a duration.

Hi @bhroam,

Thank you for the quick responses and suggestions. My apologies, I should’ve responded right away before diving into working on the suggested delete scenario.

  1. Per your suggestion, we looked into firing the endjob hook on the job delete (in req_delete.c); specifically only for jobs that were already running.
    What we found was when we added a call to process hooks that the hook was actually being called twice, once by the new endjob hook call within req_deletejob2 and then again by job_obit after the mom responds to the delete request.
    So we now added a call to the endjob process hooks in node_manager to handle the case when the mom can not be reached and the job is forced deleted.

  2. Wrt modifiable job elements, we agree that it would definitely be helpful. However, at the moment, since we don’t have a specific need on our end that calls for it and we needed these changes pull into our production environment, we would like to keep it currently as is and create a PR in the future to address this.

  3. Agreed, updated string formatting per suggestion.

Hi @bayucan,

Thank you for the responses and suggestions.

  1. I’ve updated to design doc per your suggestion to add a sub-bullet under the bullet that describes the .accept and .reject calls to say the following “In the case where the hook script encounters an exception, the error is logged.”
  2. As @dtalcott and @ toonen brought up; I made an error and should have specified that the endtime is actually the time in seconds since epoch (updated)

@sdass : looks good. Thanks.

Thanks for making the changes @sdass
I notice you say that the hook is called when a job is being deleted and we can not contact the mom. In the forums you said it will be called when a job is being deleted. I think you should mention that in the document.

Just a small nit about the example hook (I wouldn’t normally care, except it might end up in the docs):

  1. You say pbs.logmsg(pbs.LOG_DEBUG, ’ ‘%s’ ') but don’t don’t have the ‘% (var)’ to fill in the %s
  2. a few lines down on the job endtime line you still have double %%s.

The document is looking really good!
Bhroam

Thanks for taking another look with helpful suggestions @bhroam

I went ahead and updated the design doc to describe the scenario when the job is deleted and we can not contact the mom. Also, we went back and tested some more and found that when you force delete a running job, we actually do need to add a call to process_hooks in the req_deletejob2 function. Design doc also captures this now.

I have also updated the example script to address the two points described.
Thanks for your help!

@sdass
Thanks for making all the changes. It looks good to me now.

Bhroam

@sdass
Now that I am doing the code review for the endjob hook event, I have a further question. Why is the endjob hook being fired on a qrerun? Also, why is it not being fired if a job was never run and deleted?

To me an endjob hook event should happen when a job enters the ‘F’ state (or leaves the system if history isn’t on). A qrerun will return a job to the ‘Q’ state. Won’t this mean the endjob hook will fire multiple times? Once on the qrerun, and once when the job enters the ‘F’ state at the end? If that is the case, won’t it be important for the hook writer to know the job isn’t leaving the system?

If the endjob hook doesn’t fire where a job is deleted, but not run, you have a situation where the quejob hook fired, but not the endjob hook. You might have done some processing for the job entering the system in the quejob hook. You want to do some job leaving the system processing in the endjob hook. Now there is a situation where the endjob hook might not fire, but the job leaves the system.

I’m assuming this hook is meant for some sort of accounting purposes. Is the reason you want the hook to trigger on a rerun is so you can account for the resources used on the first run of the job before they are discarded?

To me an endjob hook is the opposite of the quejob hook. One triggers when a job enters the system, and the other triggers when the job leaves the system.

Sorry for these questions so late in the game.

Bhroam

@bhroam

The design and implementation have the behavior we require, but you are right that the name doesn’t represent that very well. We are discussing options internally and will get back to you. Thanks for the feedback.

–brian

@bhroam:

The goal of this hook is for provide notification when the server has decided a job is no longer running. As you pointed out, naming the hook ENDJOB implies the job has reached the finished state. To remedy this, we propose renaming the hook to better match its purpose and implementation. Possible names we arrived at are TERMJOB, ENDRUNJOB, RUNENDJOB and EXITJOB. Any feedback you can provide, either on those names or other possibilities, would be greatly appreciated.

Thanks,
–brian

Ahh, now I understand what you are trying to achieve. Yes, I agree that endjob is not the right name.

What you want is basically the server version of the execjob_end hook. A hook that fires when a job leaves execution.

Speaking of the execjob_end hook, is there a reason you chose to not use it?

I asked around Altair, and @arungrover came up with “jobobit” or just “obit” which we all liked.

Bhroam

JOBOBIT was also well liked by us and we have proceeded with the renaming in a separate branch. We will merge the naming changes into the PR after we finish addressing the comments and requests made on GitHub.

As for the EXECJOB_END hook, for reporting and failure analysis purposes, we needed the activity and associated times from the server’s perspective. While the information from the MoM is also useful, we consider the server to be the authoritative and most reliable source of the information we require.

–brian

Thanks for changing the name. I think it’ll better describe what the hook is doing.

Interesting. I would have thought the mom would be more authoritative. It’s where the job is running. When the execjob_end hook is run, you should get everything you need. In any case, I don’t think the execjob_end hook is run on a qrerun, so we’d have to have done something special with it. The last time we considered running the execjob_end hook on another event, we decided to write a new hook for it (execjob_abort). So it’s probably best you are doing what you are doing.

Bhroam

@toonen : one minor update to the design doc: pbs_hook_endjob.py → pbs_hook_jobobit.py