Dear All,
what happens to a job when a node gets shut down while job is running in a cluster.
Thanks,
ANS.
Dear All,
what happens to a job when a node gets shut down while job is running in a cluster.
Thanks,
ANS.
Hi @Ans,
When the execution host goes down, the server will lose contact with it and the job will be re-queued (rerunnable jobs) or deleted (non-rerunnable jobs) depending on the value of the server attribute node_fail_requeue.
From the admin guide -
The node_fail_requeue attribute can take these values:
Greater than zero
The server waits for the specified number of seconds after losing contact with a primary execution host, then attempts to contact the primary execution host, and if it cannot, requeues any jobs that can be rerun and deletes any jobs that cannot be rerun.
Zero
Jobs are not requeued; they are left in the Running state until the execution vnode is recovered, whether or not the server has contact with their Mother Superior.
Less than zero
The attribute is treated as if it were set to 1, and jobs are deleted or requeued after the server has been out of contact with Mother Superior for 1 second.
The default value for this attribute is 310, meaning that when the server loses contact with an execution host, it waits for 310 seconds after losing contact with Mother Superior before requeueing or deleting jobs.
Thanks,
Prakash
Hi Prakash,
Thank you for the reply.
How can i check the attribute is set or not. Also can you let me know how does the value affects when it is greater than zero like if i set it to 5 & 10.
So if it is set to 5 means that the job will wait for 5 min for requeuing.
Thanks,
ANS
Please run the below command and check whether that attribute is set
qstat -Bf | grep node_fail_requeue
So if it is set to 5 means that the job will wait for 5 min for requeuing.
[Answer]: it will wait for 5 seconds
Thank you.
But what happens if the job ran for a day or two out of 4 days to complete the job and execution host went offline. whether the job will start from step 1 or from when the execution host went down.
Thanks,
ANS
Job are getting restarted when the execution host gets down and re queued.
Can we make sure to restart the job from where it got stopped like checkpoint form PBS.
Thanks,
ANS
Ans,
The application that you are running should support checkpoint and resume. We have many applications that support checkpoint and restart.
Manual method:
The application writes out restart files for ānā number of iterations periodically, to stop the run a terminate script is uploaded or signal is sent. Then the same job is submitted with all the restart files , then the job will start from where it was left off and not from the scratch
PBS controlled:
If the checkpoint and restart has to be automatically handled by PBS Pro, then please integrate the application checkpoint scripts with PBS Pro. Please refer 9.3 Checkpoint and Restart from the admin guide High-performance Computing (HPC) and Cloud Solutions | Altair
Thank you will check and update.