Hi,
I would like to open a discussion on notifying a user of a job failure by PBS. Especially, in case of a job exceeding the resources.
Right now, it is a bit complicated for the user to get to know that their job ended due to exceeding resources. The exit code is >256, which means the job was killed by a signal and the comment is set to the general note that the job failed. IMHO this is very vague for such an important event.
The abort event is always sent be PBS as an email. My first idea is to consider exceeding resources to be the job abort too. Shouldn’t the job has its own negative exit code for exceeding resources like walltime or mem? According to this exit code, the comment can be appropriately set to the exceeded resource, and also the appropriate email could be forcibly sent? And the user would immediately know.
The next thought is a bit more complicated and more daring. How about adding some array of events tied with each job - new job attribute? This array would record each important event of a job. So you can easily track what happened with the job. Yes, the admin can do some similar tracking via logs and printjob. But this new attribute would be available to a common user via qstat… This way we could record interesting events of a job and the user would know. (Not only job abort reasons but I can also imagine useful events like the date of last top job estimates)
How do you deal with exceeding resources? Do your users ask what happened with their job in such a case?
For us, it often is a source of additional questions to our user-support.
We partially resolved the issue by modifying the mom and adding log of “resource kill” record into syslog, and each midnight we check the syslog for records and forcibly email users outside PBS. I feel like this can be resolved better.
For the second thought, I have one more additional idea: The job can be also killed by oom killer for some reason (without pbs to know)… What do you think of adding a possibility to inform the mom that some PID on the node was killed by oom killer using some hook? Mom could check if the PID belongs to some job, and add the event to the array.
Does the oom kill happen to you sometimes?
I am willing to do the development in case of interest.
Please, feel free to share any thoughts,
Vaclav