When I submit a dummy array of 1000 jobs to our PBS cluster (version 20.0.1) roughly ~35% of the jobs will end in a “post job file processing error”. These dummy jobs don’t perform any tasks besides a single “echo” statement. Even so, 347/1000 of the jobs end with the following error:
Are you really sending the job output to “/output”? The user has write access to the root directory?
Also, all of your jobs’ stdouts and stderrs go to the same file. This can cause trouble when multiple jobs try to write at the same time. Try removing your PBS -o and -e arguments so each subjob uses distinct files by default. See if that makes the problem go away.
(If you really want the stdout and stderr for a given job to go to the same file, take a look at the -j qsub option. E.g., -o foo.out -j oe.)
You might get more information about the exact failure by consulting the mom logs on the execution hosts.
I see this error from time to time, and not saying this is the issue but have you checked the undelivered folder? (/var/spool/undelivered on the exec host) to see if they are there by chance? Are they not be written at all?
Long story short, do the faiulres have a node in common? a common file system? or ?
Aside, but similar, about that error
I have a wrapper for my cp command because of an annoying nfsv4 file system. I get that error when I update an image and someone forget to check or flub the wrapper (the file is copied in my case, but PBS sees the warning about permissions this file systems causes as a failure)