Resilient Interactive Job

Here’s a design proposing a way for an interactive job to survive even after the client host issuing ‘qsub -I’ loses connection to the execution host: Resilient Interactive Job Design

I like it.

In the description of pbs_interact, you say

  • With ‘screen’ being the only value recognized right now, this would essentially do the equivalent of:

ssh <primary_host> screen -r

but using the pbs_mom channel instead of sshd.

Could you explain this more? I’m not sure what is meant by “pbs_mom channel”.

Thanks.

@dtalcott : Thanks. What I meant with ‘using the pbs_mom channel instead of sshd’ is that by calling ‘pbs_interact <job-id>’, it would be talking to the primary pbs_mom daemon executing <job-id>. This is in contrast with using ‘ssh’ to talk to the sshd daemon of the execution host to run ‘scren -r’. I’ll update the design doc to make this clearer.

Do you plan that pbs_interact will borrow a lot of code from qsub -I? I.e., set up a listening socket and somehow inform the lead MoM where that socket is. The MoM will then connect back to that socket just as it would for qsub, except instead of launching plain screen, it launches screen -r? The net result is that a pbs_interact connection is essentially identical in nature to the original qsub -I connection (except for port number)?

Yes, that’s the idea.