Cray will be releasing a new Shasta supercomputer with new interfaces for PBS to use. We have created a design that covers the proposed changes to support Shasta. Please take a look:
There has been some discussion already about catching a KeyboardInterrupt being dangerous. Because if the hook doesn’t exit, then it will sit there and the job will remain in the ‘E’ state.
The problem is that there is no fail_action for the execjob_end hook. And we need to mark a node offline because if we alarm, it indicates there was some sort of problem. We have a couple of possible solutions:
- Have a timeout for the whole hook, and set it to some amount of time shorter than the hook alarm. Once the timeout happens, we set the node offline as part of the hook. When the hook alarm happens, we don’t have to do anything special.
- Catch the exception, but make sure we put a timeout on the portion of the hook that will mark the node offline. That way, in case something goes wrong and and take a long time, the hook will still end/exit.