Notification of job failure

Hi,

I would like to open a discussion on notifying a user of a job failure by PBS. Especially, in case of a job exceeding the resources.

Right now, it is a bit complicated for the user to get to know that their job ended due to exceeding resources. The exit code is >256, which means the job was killed by a signal and the comment is set to the general note that the job failed. IMHO this is very vague for such an important event.

The abort event is always sent be PBS as an email. My first idea is to consider exceeding resources to be the job abort too. Shouldn’t the job has its own negative exit code for exceeding resources like walltime or mem? According to this exit code, the comment can be appropriately set to the exceeded resource, and also the appropriate email could be forcibly sent? And the user would immediately know.

The next thought is a bit more complicated and more daring. How about adding some array of events tied with each job - new job attribute? This array would record each important event of a job. So you can easily track what happened with the job. Yes, the admin can do some similar tracking via logs and printjob. But this new attribute would be available to a common user via qstat… This way we could record interesting events of a job and the user would know. (Not only job abort reasons but I can also imagine useful events like the date of last top job estimates)

How do you deal with exceeding resources? Do your users ask what happened with their job in such a case?

For us, it often is a source of additional questions to our user-support.

We partially resolved the issue by modifying the mom and adding log of “resource kill” record into syslog, and each midnight we check the syslog for records and forcibly email users outside PBS. I feel like this can be resolved better.

For the second thought, I have one more additional idea: The job can be also killed by oom killer for some reason (without pbs to know)… What do you think of adding a possibility to inform the mom that some PID on the node was killed by oom killer using some hook? Mom could check if the PID belongs to some job, and add the event to the array.

Does the oom kill happen to you sometimes?

I am willing to do the development in case of interest.

Please, feel free to share any thoughts,
Vaclav

Hi Vaclav,

Cool ideas. Thanks for bringing them to the forum.

  1. I think it is a fair ask - i think we should (and probably relatively easily) change the code so that a forced job kill is reported with a different exit code and a proper message set. Whether this would adversely impact any site is something i do not know - so we need to be careful in case this would break any existing integration. I am not sure how much value the OOM killer detection would have - since if the job dies, mom gets to know anyway and the exit code represents the signal with which it was killed, so this value can be determined and a message added anyway (so if you see SIGKILL, then you know the process died due to a kill signal)

  2. About the array per job to store events of a job: The core idea is cool. However we need to worry a bit about whether we should store all that extra data inside pbs server? That could be a huge amount of data to store the list of events and could increase memory pressures. One way to address this is to implement a event streaming service with a topic/subscription pattern, such that interested clients can subscribe to events (of their interest) as they happen. This is anyway something that will benefit a huge amount of use cases, and pushes the data and processing of these events to the client (rather than keeping it in the server). If this is something that interests you, we can try to come up with a collaborative design/architecture

Thanks for the response @subhasisb.

  1. Sounds good. I think I can start to work on the design doc for the change.

    • I understand it could affect some sites. With some exaggeration, I would carefully say it would be rather in a good way:-) E.g. In the case of an interactive job, the exit status is even zero for exceeding resources… that is definitely misleading.

    • Maybe I am not right, but I am afraid the OOM kill can be recognized only by checking the system log. If it is true it is not very convenient. I suppose tailing the system log by mom is not an option… even if it would be optional?

  2. First, I thought the data would be stored on the server. In my particular use case, the events should be handy in case something odd happens with a job. So I do not know what I would need in advance. Of course, You have a good point concerning the amount of data stored in the server. The stream of events sounds very good. This could be really useful. I am interested.

Vaclav