What happens to job when a node gets shut down while job is running

Dear All,

what happens to a job when a node gets shut down while job is running in a cluster.

Thanks,
ANS.

Hi @Ans,

When the execution host goes down, the server will lose contact with it and the job will be re-queued (rerunnable jobs) or deleted (non-rerunnable jobs) depending on the value of the server attribute node_fail_requeue.

From the admin guide -

The node_fail_requeue attribute can take these values:
Greater than zero
The server waits for the specified number of seconds after losing contact with a primary execution host, then attempts to contact the primary execution host, and if it cannot, requeues any jobs that can be rerun and deletes any jobs that cannot be rerun.
Zero
Jobs are not requeued; they are left in the Running state until the execution vnode is recovered, whether or not the server has contact with their Mother Superior.
Less than zero
The attribute is treated as if it were set to 1, and jobs are deleted or requeued after the server has been out of contact with Mother Superior for 1 second.

The default value for this attribute is 310, meaning that when the server loses contact with an execution host, it waits for 310 seconds after losing contact with Mother Superior before requeueing or deleting jobs.

Thanks,
Prakash

Hi Prakash,

Thank you for the reply.

How can i check the attribute is set or not. Also can you let me know how does the value affects when it is greater than zero like if i set it to 5 & 10.

So if it is set to 5 means that the job will wait for 5 min for requeuing.

Thanks,
ANS

Please run the below command and check whether that attribute is set
qstat -Bf | grep node_fail_requeue

So if it is set to 5 means that the job will wait for 5 min for requeuing.
[Answer]: it will wait for 5 seconds

Thank you.

But what happens if the job ran for a day or two out of 4 days to complete the job and execution host went offline. whether the job will start from step 1 or from when the execution host went down.

Thanks,
ANS

Job are getting restarted when the execution host gets down and re queued.

Can we make sure to restart the job from where it got stopped like checkpoint form PBS.

Thanks,
ANS

Ans,

  1. The jobs will be killed or requeued depending on the node_fail_requeue attribute value
    What happens to job when a node gets shut down while job is running - #2 by prakashcv13

The application that you are running should support checkpoint and resume. We have many applications that support checkpoint and restart.

Manual method:
The application writes out restart files for ā€œnā€ number of iterations periodically, to stop the run a terminate script is uploaded or signal is sent. Then the same job is submitted with all the restart files , then the job will start from where it was left off and not from the scratch

PBS controlled:
If the checkpoint and restart has to be automatically handled by PBS Pro, then please integrate the application checkpoint scripts with PBS Pro. Please refer 9.3 Checkpoint and Restart from the admin guide High-performance Computing (HPC) and Cloud Solutions | Altair

Thank you will check and update.