If a node is down when a job is doing staging out

Hi Team,

node_fail_requeue will be rerun the job when a job is running then a node is down.

could you let me know if the job is in stage out ( rcp the output to target) when the node is down. is node_fail_requeue also work for this ? or is there any other option /hook can handle this situation?

Thanks.

node_fail_requeue is the time for the server to wait for primary execution host to come back up
before it requeues or deletes the host’s jobs. Setting a value of 0 will disable it. If parameter is unset; It will revert back to default (310 seconds) after PBS server daemon is restarted

The stageout would not succeed as the communication between the node and the server is affected.

What would like to do in this situation ? what is your goal ?

Thanks Adarsh.

I wanted to know how to requeue the job when the node is down while the job is in stageout stage.

Thank you

  • there are no hook events to requeue the job checking the communication/node is down during the stageout phase
  • during stageout job will be in “E” status, if the node goes down at this point the job tends to continue to be in “E” state and node_fail_requeue initiates - might requeue or delete the job
  • node being down can be any of these
    - pbs_mom service is stopped
    - pbs_mom host is shut down
    - pbs_mom has lost network communication with PBS Server
    - pbs_mom is not resolvable

Good scenario, i do not have an answer, did you get a chance to try this ?

As a root user you can qrerun the job , if in case you know the node is going to down or you have some kind of external script .

Thanks again.

we are running pbs on cloud, sometimes the instant node will be removed without any notification . so the job will be failed anytime. we just need to prepare the solution for our customers.

we will see if node_fail_requeue would work for ‘E’ jobs.

Thank you, if it is not classified could you please explain about your setup.

  • As long as the cloud nodes are known to PBS Server and their DNS is resolvable (and /etc/hosts are populated correctly) , they can join and leave any time , you can increase the node_fail_requeue time so you have more lead time for the nodes to come back.

Could you please give more information on this. How would be the node removed ?

Thank you

per our observation, the job will be deleted instead of requeue if it is in ‘E’ stats when a node is gone.

we are using spot instance in AWS, so the node will not be back once it is get removed. ;(

Thanks.