How to Restart/Rerun a Completed Job in a Job Array?

I am working with job arrays and encountered an issue. Some jobs in my job array fail for various reasons. I need to restart specific failed jobs.

Here’s the scenario:

  • I submit a job array.
  • The jobs in the array execute, but some fail.
  • The job array is completed

How can I restart the i-th job in the array?

I tried using the qrerun command, but it fails with the message “Job already finished.”

Is there a command or method to restart a specific job within a job array that has already completed?

Thank you for your help

  • a specific subjob of a job array cannot be resubmited or qrerun
  • you might have to resubmit the job array and run that specific range of subjobs.
  • exit status of the job array is send when the entire job array has completed.

Thanks for the answer.

In case I get errors from the following indices: 1, 2, 3, 5, 7, 11, 13, 17, 19, 23

How can I resubmit them?

Reading the qsub man page, it seems I cannot cherry-pick the indices but must input a range.

 -J <range> [%<max subjobs>]
               Makes this job an array job.   Sets job's array
               attribute to True.

               Use the range argument to specify the indices of
               the subjobs of the array.  range is specified in
               the form X-Y[:Z] where X is the first index, Y is
               the upper bound on the indices, and Z is the step‐
               ping factor.   For example,   2-7:2 will produce
               indices of 2, 4, and 6.  If Z is not specified, it
               is taken to be 1.  Indices must be greater than or
               equal to zero.

               Use the optional %max subjobs argument to set a
               limit on the number of subjobs that can be running
               at one time.   This sets the value of the
               max_run_subjobs job attribute to the specified
               maximum.

               Job arrays are always rerunnable.

Is it possible to do that?

Correct, the subjobs cannot be run individual ( it would be same as the standard job and not job array)

resubmit entire job array

There is no option, hence not possible

Community members might share or add their suggestions. Sorry

1 Like

Hello,
Can you please explain your problem in detail.

Thanks, @adarsh for the detailed response. Now it is clear to me.


@chloeadams here the steps in details:

  1. Job Array Submission:
  • User foo submits a job array of 100 subjobs.
  1. Job Scheduling:
  • The scheduler starts the job and assigns subjobs to nodes.
  1. Job Execution and Failures:
  • Some subjobs fail for various reasons during execution.
  1. Job Array Completion:
  • The job array completes with some subjobs failed.
  1. Error Detection:
  • User foo notices the failed subjobs.
  1. Rescheduling Failed Jobs:
  • User foo wants to reschedule only the failed subjobs.