Subjobs getting held | too many failed attempts to run subjob

Chase · March 27, 2023, 11:40pm

Looking for some input and thoughts. Thanks ahead of time.

I have openpbs 20.0.1 installed on Ubuntu 18.04 managing a cluster with my employer. The jobs typically reference data hosted on Pixstor GPFS. Our workflows almost entirely consist of array jobs. For a long time I’ve been seeing occasional subjobs (not all of an array) with the below comment from qstat -xf

comment = Job Array Held,
     too many failed attempts to run subjob

sometimes this comment

comment = Job run at Mon Mar 27 at 18:24 on (xxx.xxx.xxx:ncpus=1
:mem=4194304kb) and failed

Earlier this week I had nearly all the DIMMs for all compute nodes replaced under warranty (Dell identified the DIMM serial numbers were in a known bad batch). Since then this failure has become much more common.

One of the frustrating things with this failure is that often the subjob state shows as finished even when the “too many failed attempts” error is present.

Here’s a tracejob from the example above that created the 2nd comment. This is a specific subjob.

03/27/2023 18:24:40 S enqueuing into workq, state 1 hop 1
03/27/2023 18:24:40 S Job Run at request of Scheduler@xxx.xxx.xxx on exec_vnode (xxx.xxx.xxx:ncpus=1:mem=4194304kb)
03/27/2023 18:24:40 S Obit received momhop:1 serverhop:1 state:4 substate:41
03/27/2023 18:24:40 S dequeuing from workq, state 1
03/27/2023 18:24:40 S enqueuing into workq, state 1 hop 1
03/27/2023 18:24:40 S Job Run at request of Scheduler@ch3lahpchn1 on exec_vnode (xxx.xxx.xxx:ncpus=1:mem=4194304kb)
03/27/2023 18:24:40 L Job run
03/27/2023 18:24:40 L Job run
03/27/2023 18:24:40 S Obit received momhop:2 serverhop:2 state:4 substate:41
03/27/2023 18:24:41 S dequeuing from workq, state 1
03/27/2023 18:24:41 S enqueuing into workq, state 1 hop 1
03/27/2023 18:24:41 S Job Run at request of Scheduler@xxx.xxx.xxx on exec_vnode (xxx.xxx.xxx:ncpus=1:mem=4194304kb)
03/27/2023 18:24:41 L Job run
03/27/2023 18:24:46 S Obit received momhop:3 serverhop:3 state:4 substate:42
03/27/2023 18:24:46 S Exit_status=1 resources_used.cpupercent=100 resources_used.cput=00:00:02 resources_used.mem=427644kb resources_used.ncpus=1 resources_used.vmem=2584820kb
resources_used.walltime=00:00:05

Topic		Replies	Views
Proposal of interface regarding hold/release of subjob(s) and job array Developers	35	2839	May 22, 2019
PBS job submission problem Users/Site Administrators	2	670	August 15, 2023
Job array with pbspro Users/Site Administrators	21	4053	June 22, 2020
Max running Job Users/Site Administrators	12	1202	August 11, 2021
Qsub/qstat slow (or failing) with thousands of jobs submitted Users/Site Administrators	10	2785	July 29, 2021

Subjobs getting held | too many failed attempts to run subjob

Related topics