Autofs of SSH Keys

Looking for some thoughts on approaching user home directory mounting with SSH keys and the possible effect on job recovery with OpenPBS.

I’m in process of setting up a new HPC. I’m using an LDAP server to manage users on an Ubuntu 18.04 cluster. I’m considering using ldap-autofs to mount the user home directories from their local machines. These user home directories would hold the SSH keys for that OpenPBS uses for SSH authentication.

From my understanding, OpenPBS is capable of recovering interrupted jobs. If the network is lost, and the mount of the user home directory and SSH keys with it. Will that prevent a job from recovering?

openpbs has this server attribute, which can be set/updated using qmgr -c “set server node_fail_requeue=” . This capability should be helpful.

Please note:

  • For any authentication issues with respect to the user on the compute node - will put the job into hold state, so it is necessary that user can be ssh into the compute node without any issues or passwordless ssh works for the user before job is scheduled onto a node

Please could you describe what you mean by interrupted jobs ?

With interuppted jobs, I’m referring to jobs that have not been able to finish due to some failure but are recoverable once the necessary services and daemons are working again. I’ve seen some material briefly highlighting PBS being able to recover from failures and a little documentation referring to recovering jobs.

openpbs or any work load managers would not know whether jobs have computed successfully with respect to that application or analyze whether the job/app computations are going in the right direction. The work load managers are like postman scheduling the jobs on to the compute nodes based on a policies and do not know/read the contents of the post (job). If the exit status of the application command line is 0 , then success , else job failed. qstat -fx | grep -i exiit . But with an intelligent parser script part of the pbs script , which reads the log output after main application has exited and decides whether it is successful or not and decides the exit code would have some additional dimension to the job runs.

PBS can reschedule jobs onto different node, if a job running on that node has issues or rebooted or node crashed abruptly and many more.

PBS server can be configured for High Availability

PBS supports application level checkpoint and restart

PBS support suspend and resume of a job

These jobs have to be resubmitted again , if they have iterator file or restart file created after interruption, upon resubmission, if the application is intelligent to start from where it was left off when it was interrupted by reading this iterator file, then the application will start to run from that point onwards and not from the start.

Thanks for the additional clarification. I’m aware that whether a job is known to have finished or not, is based on exit codes and not if the job was successful in terms of some other desired outcome.

These jobs have to be resubmitted again, if they have iterator file or restart file created after interruption, upon resubmission, if the application is intelligent to start from where it was left off when it was interrupted by reading this iterator file, then the application will start to run from that point onwards and not from the start.

This is what is making me a bit concerned about using autofs to mount user home directories with their SSH keys. If the network hosting the mount goes down, then I believe PBS would lose its user SSH authentication. Whether or not that would kill an existing job or prevent PBS from rescheduling a job to another node (if needed), I’m not sure.

You could try using ssh hostbased authentication which does not depend on user SSH keys. This is more safer than user ssh key based authentication. Also, you can implement health check scripts in openpbs hooks (mom periodic hook , execjob_begin) to find out home directory is mounted or not and depending on that offline the node or reject jobs , so that jobs are not schedule on to them.

This needs to be tested without PBS by launching the same application manually on the compute node and unmounting the home directory and see the consequences. The same consequences will be channelled to PBS. I am not sure about any autonomic operations at this moment, rather than having server periodic hook to find out the compute node has lost the home directory and to kill the job or restart the pbs mom services on the compute which will requeue the job on another node.

If user authentication fails then job will be put in “H” hold status (after PBS retries 21 times to run the job) . Finally the job has to be resubmitted.