where do i find the jobs that were killed by pbs_mom due to resource violation? Is information about these jobs stored in a database? Can I handle these jobs using a hook?
Jobs that are killed by resource violation can be found
- by checking those suspected job id’s in the $PBS_HOME/mom_logs/YYYYMMDD
- by running $PBS_EXEC/unsupported/pbs_dtj -n 10
If you want to make sure , they are controlled within the limits, then you can use Cgroups.
If this is not what you mean by handle these jobs using hooks, please let us know what you like to do .
Thanks for the answer. Using cgroups may be useful to us.
I need to identify when a job is killed due to resource violation and perform some operation. For example, I need to send a custom notification to administrators; I need to store information about these jobs in a database for future analysis. This information along with other data from our users (such as project, department and others) will be useful for administrators to tailor the training given to users using our cluster.
Thank you for this information.
These might be some of the ways
- to address/report/store this is by running pbs_dtj on every job and check for exceeded or equivalent messages or by scanning mom logs and storing jobs
- get the Exit_status in the hook events and record the job details it in the accounting logs or writing to a file