Has anyone run into a situation where a job re-queues itself after a power failure? We had an unusual power failure and dropped a set of compute nodes and the private eth switch. Running jobs went back to a ‘Q’ queued state for some reason. Once the compute nodes were restored the jobs automatically restarted and when to a ‘R’ state overwriting all previously generated data.
I’ve tried to reproduce the situation and have been unsuccessfull:
Anytime I rebooted the slave compute node and the master compute node remained on the job would fail immediately. The data remained intact and the .o file was copied back to the head node.
Anytime I rebooted the master compute node the job would hang in the queue in a ‘R’ state. All the files are kept in the working directory except for the .o file.
In my testing the scheduler did as I hoped it would. This keeps the data intact without loosing potentially days of work. Just wondering if there is a situation where jobs go back to a 'Q’ed state on failure or if this was just an anomaly we saw.
Note - The nodes are disk-less
Thanks in advance, Kyle