Jobs are killed because of "post job file processing error"

When I submit a dummy array of 1000 jobs to our PBS cluster (version 20.0.1) roughly ~35% of the jobs will end in a “post job file processing error”. These dummy jobs don’t perform any tasks besides a single “echo” statement. Even so, 347/1000 of the jobs end with the following error:

From adm@address Fri Feb 3 09:27:51 2023
Return-Path: adm@address
X-Original-To: user@cm.cluster
Delivered-To: user@address
Received: by address (Postfix, from userid 0)
id 4887F160000CD; Fri, 3 Feb 2023 09:27:51 -0500 (EST)
To: user@address

Subject: PBS JOB 6809[973].cluster

Message-Id: 20230203142751.4887F160000CD@address
Date: Fri, 3 Feb 2023 09:27:51 -0500 (EST)
From: root adm@address
PBS Job Id: 6809[973].cluster
Job Name: test.sh

Post job file processing error; job 6809[973].cluster on host n01

What is causing this error and how do I stop it from happening in the future?

The test script I am using is as follows:

#!/bin/bash
#PBS -S /bin/bash
#PBS -o /output
#PBS -e /output
#PBS -J 1-1000

echo $PBS_JOBID

Are you really sending the job output to “/output”? The user has write access to the root directory?

Also, all of your jobs’ stdouts and stderrs go to the same file. This can cause trouble when multiple jobs try to write at the same time. Try removing your PBS -o and -e arguments so each subjob uses distinct files by default. See if that makes the problem go away.

(If you really want the stdout and stderr for a given job to go to the same file, take a look at the -j qsub option. E.g., -o foo.out -j oe.)

You might get more information about the exact failure by consulting the mom logs on the execution hosts.

All of the output is going to unique files, I just replaced the file path with “/output” in the question.

I see this error from time to time, and not saying this is the issue but have you checked the undelivered folder? (/var/spool/undelivered on the exec host) to see if they are there by chance? Are they not be written at all?

Long story short, do the faiulres have a node in common? a common file system? or ?

Aside, but similar, about that error
I have a wrapper for my cp command because of an annoying nfsv4 file system. I get that error when I update an image and someone forget to check or flub the wrapper (the file is copied in my case, but PBS sees the warning about permissions this file systems causes as a failure)