Automatic re-queue of failed jobs

kyle · September 19, 2017, 3:16pm

Has anyone run into a situation where a job re-queues itself after a power failure? We had an unusual power failure and dropped a set of compute nodes and the private eth switch. Running jobs went back to a ‘Q’ queued state for some reason. Once the compute nodes were restored the jobs automatically restarted and when to a ‘R’ state overwriting all previously generated data.

I’ve tried to reproduce the situation and have been unsuccessfull:

Anytime I rebooted the slave compute node and the master compute node remained on the job would fail immediately. The data remained intact and the .o file was copied back to the head node.

Anytime I rebooted the master compute node the job would hang in the queue in a ‘R’ state. All the files are kept in the working directory except for the .o file.

In my testing the scheduler did as I hoped it would. This keeps the data intact without loosing potentially days of work. Just wondering if there is a situation where jobs go back to a 'Q’ed state on failure or if this was just an anomaly we saw.

Note - The nodes are disk-less

Thanks in advance, Kyle

adarsh · September 20, 2017, 11:58am

Kyle,

Jobs are only requeueable if the qsub optio ‘-r’ is set to ‘y’ (default is ‘y’ )
requeued jobs .OU and ER are copied to the PBS Servers $PBS_HOME/spool directory to be appended for the next run

Could you please check the node_fail_requeue value set in your pbs configuration. (qmgr -c “p s” | grep -i requeue )
node_fail_requeue: Time for the server to wait for primary execution host to come back up before it
requeues or deletes the host’s jobs. Setting a value of 0 will disable it. If parameter is unset; It will revert back to 310 second after PBS server deamon is restarted

Topic		Replies	Views
Jobs stuck in R status after power failure Users/Site Administrators	9	99	July 4, 2024
Job Transit To Another Node Users/Site Administrators	5	138	May 14, 2024
If a node is down when a job is doing staging out Users/Site Administrators	6	829	September 23, 2019
Is it possible to execute queued jobs automatically after computing nodes up? Users/Site Administrators	2	1085	November 9, 2018
PBS job submit problem Users/Site Administrators	1	3213	May 15, 2017

Automatic re-queue of failed jobs

Related topics