Determine cause of job failure

I have several users who are submitting jobs. They change from R to E several times, and then eventually end up in status H.

Although specified with the -o out.txt and -e out.err options, I am not getting any output.

How can I determine why these jobs are erroring out? I have also checked in the pbs_server.log file, but see nothing there useful

There is a utility called tracejob (/opt/pbs/bin/tracejob) which will be on the head node as it comes from the PBS server package. man tracejob will tell you the details.
e.g. tracejob -n 2 123307.hpcnode1
-n is report information from up to days days in the past.
That may well give some useful information.
Mike

1 Like