Design document for endjob hook event

Hi All,

A design proposal for adding an endjob hook event has been created. The purpose is to allow for recording job information and the time of when a job ends which is required for more detailed accounting of jobs for some sites.

Design Doc endjob hook event

This is a WIP and feedback would be appreciated.

Thanks!

@toonen @pershey

Hey @sdass
Thanks for writing this design document. I have a couple of comments.

  1. Does this hook fire on a job delete? If not, you might want it to. Right now the hook fires for every reason the job ends except if it gets deleted.
  2. Consider allowing at least some job elements to be modified. Wouldn’t it be more helpful in the endjob hook if you could modify what shows up in the accounting log? Right now all you can do is write to the server log, or do something external to PBS.
  3. In your example hook, you use %% for string formatting, when I think you want a single %.

Bhroam

@sdass Looks good so far.

  • Please specify the behavior when the endjob hook script encounters an unexpected error causing an unhandled exception, or if the script terminates due to a hook alarm. The resulting action will be like pbs.event().reject()

  • For the format of the new job attribute ‘endtime’, you can just say type ‘pbs.duration’ which is a formal type in PBS hooks.

Also wrt endtime, it is a duration from exactly when to when? Start of user job script to end of script? Duration when resources were allocated to job (i.e., including prologue and epilogue durations)?

Interesting discussion on endtime. How is it different than resources_used.walltime?

Similar to stime, endtime is set to seconds since the Epoch at the time the server is told/detects that the job or subjob has ended. etime was already being used for eligible time, so we named this attribute endtime instead.

In that case, the design document needs updating to indicate the format is epoch time, rather than a duration.

Hi @bhroam,

Thank you for the quick responses and suggestions. My apologies, I should’ve responded right away before diving into working on the suggested delete scenario.

  1. Per your suggestion, we looked into firing the endjob hook on the job delete (in req_delete.c); specifically only for jobs that were already running.
    What we found was when we added a call to process hooks that the hook was actually being called twice, once by the new endjob hook call within req_deletejob2 and then again by job_obit after the mom responds to the delete request.
    So we now added a call to the endjob process hooks in node_manager to handle the case when the mom can not be reached and the job is forced deleted.

  2. Wrt modifiable job elements, we agree that it would definitely be helpful. However, at the moment, since we don’t have a specific need on our end that calls for it and we needed these changes pull into our production environment, we would like to keep it currently as is and create a PR in the future to address this.

  3. Agreed, updated string formatting per suggestion.

Hi @bayucan,

Thank you for the responses and suggestions.

  1. I’ve updated to design doc per your suggestion to add a sub-bullet under the bullet that describes the .accept and .reject calls to say the following “In the case where the hook script encounters an exception, the error is logged.”
  2. As @dtalcott and @ toonen brought up; I made an error and should have specified that the endtime is actually the time in seconds since epoch (updated)

@sdass : looks good. Thanks.

Thanks for making the changes @sdass
I notice you say that the hook is called when a job is being deleted and we can not contact the mom. In the forums you said it will be called when a job is being deleted. I think you should mention that in the document.

Just a small nit about the example hook (I wouldn’t normally care, except it might end up in the docs):

  1. You say pbs.logmsg(pbs.LOG_DEBUG, ’ ‘%s’ ') but don’t don’t have the ‘% (var)’ to fill in the %s
  2. a few lines down on the job endtime line you still have double %%s.

The document is looking really good!
Bhroam

Thanks for taking another look with helpful suggestions @bhroam

I went ahead and updated the design doc to describe the scenario when the job is deleted and we can not contact the mom. Also, we went back and tested some more and found that when you force delete a running job, we actually do need to add a call to process_hooks in the req_deletejob2 function. Design doc also captures this now.

I have also updated the example script to address the two points described.
Thanks for your help!

@sdass
Thanks for making all the changes. It looks good to me now.

Bhroam