If a node is down when a job is doing staging out

squallabc · September 17, 2019, 8:18am

Hi Team,

node_fail_requeue will be rerun the job when a job is running then a node is down.

could you let me know if the job is in stage out ( rcp the output to target) when the node is down. is node_fail_requeue also work for this ? or is there any other option /hook can handle this situation?

Thanks.

adarsh · September 17, 2019, 12:04pm

node_fail_requeue is the time for the server to wait for primary execution host to come back up
before it requeues or deletes the host’s jobs. Setting a value of 0 will disable it. If parameter is unset; It will revert back to default (310 seconds) after PBS server daemon is restarted

The stageout would not succeed as the communication between the node and the server is affected.

What would like to do in this situation ? what is your goal ?

squallabc · September 18, 2019, 6:48am

Thanks Adarsh.

I wanted to know how to requeue the job when the node is down while the job is in stageout stage.

adarsh · September 18, 2019, 7:39am

Thank you

there are no hook events to requeue the job checking the communication/node is down during the stageout phase
during stageout job will be in “E” status, if the node goes down at this point the job tends to continue to be in “E” state and node_fail_requeue initiates - might requeue or delete the job
node being down can be any of these
- pbs_mom service is stopped
- pbs_mom host is shut down
- pbs_mom has lost network communication with PBS Server
- pbs_mom is not resolvable

Good scenario, i do not have an answer, did you get a chance to try this ?

As a root user you can qrerun the job , if in case you know the node is going to down or you have some kind of external script .

squallabc · September 19, 2019, 2:33am

Thanks again.

we are running pbs on cloud, sometimes the instant node will be removed without any notification . so the job will be failed anytime. we just need to prepare the solution for our customers.

we will see if node_fail_requeue would work for ‘E’ jobs.

adarsh · September 19, 2019, 7:49am

Thank you, if it is not classified could you please explain about your setup.

As long as the cloud nodes are known to PBS Server and their DNS is resolvable (and /etc/hosts are populated correctly) , they can join and leave any time , you can increase the node_fail_requeue time so you have more lead time for the nodes to come back.

Could you please give more information on this. How would be the node removed ?

Thank you

squallabc · September 23, 2019, 2:48am

per our observation, the job will be deleted instead of requeue if it is in ‘E’ stats when a node is gone.

we are using spot instance in AWS, so the node will not be back once it is get removed. ；（

Thanks.

Topic		Replies	Views
Automatic re-queue of failed jobs Users/Site Administrators	1	2495	September 20, 2017
Job Transit To Another Node Users/Site Administrators	5	138	May 14, 2024
What happens to job when a node gets shut down while job is running Users/Site Administrators	7	2548	January 24, 2019
Execjob_prologue hook quesetion Users/Site Administrators	7	728	August 30, 2019
My job stay queued Users/Site Administrators	24	10465	January 27, 2020

If a node is down when a job is doing staging out

Related topics