Subjobs getting held | too many failed attempts to run subjob

Looking for some input and thoughts. Thanks ahead of time.

I have openpbs 20.0.1 installed on Ubuntu 18.04 managing a cluster with my employer. The jobs typically reference data hosted on Pixstor GPFS. Our workflows almost entirely consist of array jobs. For a long time I’ve been seeing occasional subjobs (not all of an array) with the below comment from qstat -xf

comment = Job Array Held,
     too many failed attempts to run subjob

sometimes this comment

comment = Job run at Mon Mar 27 at 18:24 on (xxx.xxx.xxx:ncpus=1
:mem=4194304kb) and failed

Earlier this week I had nearly all the DIMMs for all compute nodes replaced under warranty (Dell identified the DIMM serial numbers were in a known bad batch). Since then this failure has become much more common.

One of the frustrating things with this failure is that often the subjob state shows as finished even when the “too many failed attempts” error is present.

Here’s a tracejob from the example above that created the 2nd comment. This is a specific subjob.

03/27/2023 18:24:40 S enqueuing into workq, state 1 hop 1
03/27/2023 18:24:40 S Job Run at request of Scheduler@xxx.xxx.xxx on exec_vnode (xxx.xxx.xxx:ncpus=1:mem=4194304kb)
03/27/2023 18:24:40 S Obit received momhop:1 serverhop:1 state:4 substate:41
03/27/2023 18:24:40 S dequeuing from workq, state 1
03/27/2023 18:24:40 S enqueuing into workq, state 1 hop 1
03/27/2023 18:24:40 S Job Run at request of Scheduler@ch3lahpchn1 on exec_vnode (xxx.xxx.xxx:ncpus=1:mem=4194304kb)
03/27/2023 18:24:40 L Job run
03/27/2023 18:24:40 L Job run
03/27/2023 18:24:40 S Obit received momhop:2 serverhop:2 state:4 substate:41
03/27/2023 18:24:41 S dequeuing from workq, state 1
03/27/2023 18:24:41 S enqueuing into workq, state 1 hop 1
03/27/2023 18:24:41 S Job Run at request of Scheduler@xxx.xxx.xxx on exec_vnode (xxx.xxx.xxx:ncpus=1:mem=4194304kb)
03/27/2023 18:24:41 L Job run
03/27/2023 18:24:46 S Obit received momhop:3 serverhop:3 state:4 substate:42
03/27/2023 18:24:46 S Exit_status=1 resources_used.cpupercent=100 resources_used.cput=00:00:02 resources_used.mem=427644kb resources_used.ncpus=1 resources_used.vmem=2584820kb
resources_used.walltime=00:00:05