Notification of job failure

Hi,

I would like to open a discussion on notifying a user of a job failure by PBS. Especially, in case of a job exceeding the resources.

Right now, it is a bit complicated for the user to get to know that their job ended due to exceeding resources. The exit code is >256, which means the job was killed by a signal and the comment is set to the general note that the job failed. IMHO this is very vague for such an important event.

The abort event is always sent be PBS as an email. My first idea is to consider exceeding resources to be the job abort too. Shouldn’t the job has its own negative exit code for exceeding resources like walltime or mem? According to this exit code, the comment can be appropriately set to the exceeded resource, and also the appropriate email could be forcibly sent? And the user would immediately know.

The next thought is a bit more complicated and more daring. How about adding some array of events tied with each job - new job attribute? This array would record each important event of a job. So you can easily track what happened with the job. Yes, the admin can do some similar tracking via logs and printjob. But this new attribute would be available to a common user via qstat… This way we could record interesting events of a job and the user would know. (Not only job abort reasons but I can also imagine useful events like the date of last top job estimates)

How do you deal with exceeding resources? Do your users ask what happened with their job in such a case?

For us, it often is a source of additional questions to our user-support.

We partially resolved the issue by modifying the mom and adding log of “resource kill” record into syslog, and each midnight we check the syslog for records and forcibly email users outside PBS. I feel like this can be resolved better.

For the second thought, I have one more additional idea: The job can be also killed by oom killer for some reason (without pbs to know)… What do you think of adding a possibility to inform the mom that some PID on the node was killed by oom killer using some hook? Mom could check if the PID belongs to some job, and add the event to the array.

Does the oom kill happen to you sometimes?

I am willing to do the development in case of interest.

Please, feel free to share any thoughts,
Vaclav

Hi Vaclav,

Cool ideas. Thanks for bringing them to the forum.

  1. I think it is a fair ask - i think we should (and probably relatively easily) change the code so that a forced job kill is reported with a different exit code and a proper message set. Whether this would adversely impact any site is something i do not know - so we need to be careful in case this would break any existing integration. I am not sure how much value the OOM killer detection would have - since if the job dies, mom gets to know anyway and the exit code represents the signal with which it was killed, so this value can be determined and a message added anyway (so if you see SIGKILL, then you know the process died due to a kill signal)

  2. About the array per job to store events of a job: The core idea is cool. However we need to worry a bit about whether we should store all that extra data inside pbs server? That could be a huge amount of data to store the list of events and could increase memory pressures. One way to address this is to implement a event streaming service with a topic/subscription pattern, such that interested clients can subscribe to events (of their interest) as they happen. This is anyway something that will benefit a huge amount of use cases, and pushes the data and processing of these events to the client (rather than keeping it in the server). If this is something that interests you, we can try to come up with a collaborative design/architecture

Thanks for the response @subhasisb.

  1. Sounds good. I think I can start to work on the design doc for the change.

    • I understand it could affect some sites. With some exaggeration, I would carefully say it would be rather in a good way:-) E.g. In the case of an interactive job, the exit status is even zero for exceeding resources… that is definitely misleading.

    • Maybe I am not right, but I am afraid the OOM kill can be recognized only by checking the system log. If it is true it is not very convenient. I suppose tailing the system log by mom is not an option… even if it would be optional?

  2. First, I thought the data would be stored on the server. In my particular use case, the events should be handy in case something odd happens with a job. So I do not know what I would need in advance. Of course, You have a good point concerning the amount of data stored in the server. The stream of events sounds very good. This could be really useful. I am interested.

Vaclav

1 Like

Just a thought, but how much of this could be done via a mom hook (for a possibly new event)? This would allow sites to customize the behavior and the text of the messages.

1 Like

Hi @dtalcott, thank you for the thought.

  1. I think it even does not need a new hook event, but the right exit code would help with this task anyway. Without the exit code, the hook wouldn’t know why the job failed… Maybe, the hook could try to check the stderr of the job and determine the reason. Not sure how reliable it would be. …this way it would be even possible to accomplish the task without code modification.
    • IMHO I would consider being a significant drawback that the email would be sent from the worker node. I am not sure If it is good to send emails to users from worker nodes. IMHO it is better to use the server as the sender because that is what the users expect.
    • I personally would also like to combine this feature with configurable mailer and aggregate the emails.

I really like the idea to customize the text of the messages. Our user support asks me for this feature from time to time. It can be partially done using the configurable mailer. But your idea could also lead to a brand new hook, which would run on the server on a new email event. It would allow the modification of the message. I am not sure how hard it would be to add such a hook.

I have an overall doubt about the capability as it is currently being discussed. In my opinion and observation more and more sites are moving towards using cgroups to help manage jobs. In some cases the cgroups configuration takes resource usage enforcement out of the hands of the pbs_mom and instead the kernel, and in other cases may prevent the resource from possibly being exceeded (again, at the kernel level). This all means that pbs_mom plays a diminished role in resource enforcement.

How can we make the end user experience consistent whether it is pbs_mom or the kernel enforcing the resource limits?

Instead of putting all the burden on PBS, sites can use tools like Simple Event Correlator to watch the various logs and send appropriate, customized email.

With a tweak to qalter and pbs_server, SEC could even update the job comment after the job has finished. (I think you currently cannot update the comment of a finished job?)

We use SEC a fair amount, mostly to detect problems with nodes and offline them. We also use it to detect OOMs. For this, we use the “correlator” feature to send a single email, instead of one for each node in the job that OOMs.

Irrespective of whether it is the pbs_mom or the kernel killing the job, pbs_mom should get notified. The exact exit code might differ, but i think mom would notice and can report. Hence the different exit codes could provide additional information. The actual work of sending emails, adding job messages can be done via hooks that sites can possibly enable if required?

Agreed, several external tools can do similar functionality, However, the core proposal of recording a “special” exit code for the job in the event it was killed seems useful to me, no?