What happens to job when a node gets shut down while job is running

Ans · November 10, 2018, 5:19pm

Dear All,

what happens to a job when a node gets shut down while job is running in a cluster.

Thanks,
ANS.

prakashcv13 · November 10, 2018, 6:03pm

When the execution host goes down, the server will lose contact with it and the job will be re-queued (rerunnable jobs) or deleted (non-rerunnable jobs) depending on the value of the server attribute node_fail_requeue.

From the admin guide -

The node_fail_requeue attribute can take these values:
Greater than zero
The server waits for the specified number of seconds after losing contact with a primary execution host, then attempts to contact the primary execution host, and if it cannot, requeues any jobs that can be rerun and deletes any jobs that cannot be rerun.
Zero
Jobs are not requeued; they are left in the Running state until the execution vnode is recovered, whether or not the server has contact with their Mother Superior.
Less than zero
The attribute is treated as if it were set to 1, and jobs are deleted or requeued after the server has been out of contact with Mother Superior for 1 second.

The default value for this attribute is 310, meaning that when the server loses contact with an execution host, it waits for 310 seconds after losing contact with Mother Superior before requeueing or deleting jobs.

Thanks,
Prakash

Ans · January 19, 2019, 2:05pm

Hi Prakash,

Thank you for the reply.

How can i check the attribute is set or not. Also can you let me know how does the value affects when it is greater than zero like if i set it to 5 & 10.

So if it is set to 5 means that the job will wait for 5 min for requeuing.

Thanks,
ANS

adarsh · January 19, 2019, 6:02pm

Please run the below command and check whether that attribute is set
qstat -Bf | grep node_fail_requeue

So if it is set to 5 means that the job will wait for 5 min for requeuing.
[Answer]: it will wait for 5 seconds

Ans · January 23, 2019, 6:04am

Thank you.

But what happens if the job ran for a day or two out of 4 days to complete the job and execution host went offline. whether the job will start from step 1 or from when the execution host went down.

Thanks,
ANS

Ans · January 23, 2019, 6:18am

Job are getting restarted when the execution host gets down and re queued.

Can we make sure to restart the job from where it got stopped like checkpoint form PBS.

Thanks,
ANS

adarsh · January 23, 2019, 7:24pm

Ans,

The jobs will be killed or requeued depending on the node_fail_requeue attribute value
What happens to job when a node gets shut down while job is running - #2 by prakashcv13

The application that you are running should support checkpoint and resume. We have many applications that support checkpoint and restart.

Manual method:
The application writes out restart files for “n” number of iterations periodically, to stop the run a terminate script is uploaded or signal is sent. Then the same job is submitted with all the restart files , then the job will start from where it was left off and not from the scratch

PBS controlled:
If the checkpoint and restart has to be automatically handled by PBS Pro, then please integrate the application checkpoint scripts with PBS Pro. Please refer 9.3 Checkpoint and Restart from the admin guide High-performance Computing (HPC) and Cloud Solutions | Altair

Ans · January 24, 2019, 9:38am

Thank you will check and update.

Topic		Replies	Views
If a node is down when a job is doing staging out Users/Site Administrators	6	831	September 23, 2019
Automatic re-queue of failed jobs Users/Site Administrators	1	2499	September 20, 2017
Offline and/or reboot a node after job completes Users/Site Administrators	1	289	May 26, 2023
PP-465: qrerun timeouts when big job files are being copied from MoM to server Developers	44	4076	November 15, 2016
Execution node down Users/Site Administrators	7	2659	August 9, 2019

What happens to job when a node gets shut down while job is running

Related topics