I have one PBS Pro cluster which has three vms:
VM1: installed PBS pro server and start pbs service using user root. And has another user “test” which has the sudo permission.
VM2: installed PBS pro execution package and start pbs service using user root. And has another user “test” which has the sudo permission.
VM3: installed PBS pro client package using user root. And has another user “test” which has the sudo permission.
Then I performed the below configuraitons:
VM1’s user root can passwordless access VM2 by “ssh root@VM2”.
VM2’s user root can passwordless access VM1 by “ssh root@VM1”.
VM1’s user root can passwordless access VM3 by “ssh test@VM3”.
VM2’s user root can passwordless access VM3 by “ssh test@VM3”.
VM3’s user test can passwordless access VM1 by “ssh root@VM1”.
VM3’s user test can passwordless access VM1 by “ssh root@VM2”.
When I run " echo “sleep 60” | qsub " on VM3, the job couldn’t execute and mom logs looked like:’
11/08/2018 06:07:00;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/13.pbstest1.OU test@pbsproclientserver:/home/STDIN.o13 status=1, try=1
11/08/2018 06:07:31;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/13.pbstest1.OU test@pbsproclientserver:/home/STDIN.o13 status=1, try=2
11/08/2018 06:07:42;0080;pbs_mom;Fil;sys_copy;command: /bin/scp -Brvp /var/spool/pbs/spool/13.pbstest1.OU test@pbsproclientserver:/home/STDIN.o13 status=1, try=3
11/08/2018 06:08:13;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/13.pbstest1.OU test@pbsproclientserver:/home/STDIN.o13 status=1, try=4
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;Unable to copy file /var/spool/pbs/spool/13.pbstest1.OU to pbsproclientserver:/home/STDIN.o13
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;pbsproclientserver: Connection refused
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;in.oraclevcn.com, user test, command scp -v -r -p -t /home/STDIN.o13
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;OpenSSH_7.4p1, OpenSSL 1.0.2k-fips 26 Jan 2017
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Reading configuration data /etc/ssh/ssh_config
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: /etc/ssh/ssh_config line 58: Applying options for *
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Connecting to pbsproclientserver [10.0.0.15] port 22.
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Connection established.
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_type: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_cert: Permission denied
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_cert: Permission denied
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_cert: Permission denied
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_cert: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_type: Permission denied
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_type: Permission denied
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_type: Permission denied
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_private_type: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_cert: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_cert: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_cert: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_cert: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_rsa type 1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_rsa-cert type -1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_dsa type -1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_dsa-cert type -1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_ecdsa type -1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_ecdsa-cert type -1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_ed25519 type -1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: key_load_public: No such file or directory
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: identity file /home/test/.ssh/id_ed25519-cert type -1
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Enabling compatibility mode for protocol 2.0
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Local version string SSH-2.0-OpenSSH_7.4
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: match: OpenSSH_7.4 pat OpenSSH* compat 0x04000000
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Authenticating to pbsproclientserver:22 as ‘test’
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: SSH2_MSG_KEXINIT sent
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: SSH2_MSG_KEXINIT received
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: kex: algorithm: curve25519-sha256
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: kex: host key algorithm: ecdsa-sha2-nistp256
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: compression: none
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: compression: none
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: kex: curve25519-sha256 need=64 dh_need=64
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: kex: curve25519-sha256 need=64 dh_need=64
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;debug1: Server host key: ecdsa-sha2-nistp256 SHA256:2U4Z1gS93Kdl8uaEhp2tY95Np9IsSQgcknRtoglGigs
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;Host key verification failed.
11/08/2018 06:08:34;0004;pbs_mom;Fil;13.pbstest1.OU;lost connection
But I could run the below command on VM2(Mom host) successfully:
Thanks for posting your question! First off, you shouldn’t need passwordless access as root between any of your VMs. You will need it for the test user.
The first message that needs investigation is this:
Try using scp to copy a file from the execution host to the system from which you submitted the job. Use the -v option on scp if you need verbose output to help debug any issues.
The other message that seems out of place is:
This message is actually coming from scp, and it indicates a problem with authentication between the two systems involved.
Make sure you don’t have any firewalls blocking packets between the systems. Do so with caution, of course, and make sure you only open the ports you need unless you are on a totally secure network.
I don’t think you have any issues with name resolution or your execution host setup, because the job apparently ran. That’s another common problem we see.
Hi Mike,
Thanks so much for your reply. So the passwordless need to be configured in the three VMs and under user “test”.
If the user “test” can passwordless ssh to any other VMs, the workflow should be working as design. Is this correct?
So can I get one conclusion that the PBS pro cluster needs to have the same user on all the VMs in this cluster? Thanks a lot!
That’s the correct conclusion. However there was one thing I forgot to mention before… now that your test account is configured properly, you want to run the following qmgr command:
set server flatuid = True
That tells PBS that the same accounts exist across your cluster. Please do let us know if that fixes your problem.
Thanks Mike so much. I will follow your guide and try it again.
And another question to bother you is that:
so how to communicate between server and execution nodes? The communication depends on the passwordless ssh? It seemed not secure enough? Could you please help me understand this when you have bandwidth? Thanks a lot!
PBS Pro uses external tools (i.e. rcp or scp) to handle file transfers. Communication between PBS Pro components utilizes sockets directly, so there is no need to configure ssh keys for this purpose. For example, during job launch…
The server contacts the scheduler to begin a scheduling cycle
The scheduler responds to the server with the job IDs to run and the nodes assigned
The server contacts the MoM on the first node assigned to a job (we call mother superior) and sends it the data it needs including the full node list, the job script, etc.
Mother superior contacts the other nodes assigned to the job (the sisterhood) and relays the job information
The job begins execution on mother superior
It is only after the job exits that mother superior will invoke scp as the user to copy output back to the node from which the job was submitted. The user can tell PBS not to transfer the output files (keep them on the execution host) by specifying the -k parameter to qsub.
Thanks you both so much for the detailed info. @mkaro@adarsh
So this is my understanding for pbs pro:
passwordless-ssh(recommendation is hostbased) is used only when mother superior invokes scp as the user to copy output back to the node from which the job was submitted.
If the set server flatuid as True, we need to make sure:
same user in mother superior and the client server which the job was submitted.
hostbased passwordless-ssh works fine between mother superior and the client server which the job was submitted.
if we don’t want to let mother superior do scp/rcp to copy output back, actually we don’t need to configure any passwordless-ssh between any mother superior and any client server which jobs were submitted.
Could you please help check if my understanding is correct or not? Thanks a lot!
That sounds correct to me. If you have shared filesystems, you may also want to employ the $usecp configuration parameter for MoM. Please refer to section 15.7 of the PBS Pro 18.2 administrator’s guide. It covers the material discussed in this thread. You may download it here:
There is then a section showing scp -v, which successfully connects but shows lots of key_load_public: No such file or directory errors, although the keys are there, and executing the scp command manually works.
The log ends with this:
02/17/2020 15:45:27;0004;pbs_mom;Fil;339.mgmt.OU;Host key verification failed.
02/17/2020 15:45:27;0004;pbs_mom;Fil;339.mgmt.OU;lost connection
02/17/2020 15:45:27;0100;pbs_mom;Job;339.mgmt;Job files not copied:
Unable to copy file /var/spool/pbs/spool/339.mgmt.OU to login:/nfs/home/<user>/name.out
>>> error from copy
login: Connection refused
Does anyone have any idea what might be the cause and what the workaround/fix might be?
in the /etc/ssh/ssh_config , set StrictHostKeyChecking no
It seems when user(s) login then they have to accept the hostkey and this is blocking the file copy (guessing).
Please try as that user on the compute node, run this command
/bin/scp -Brvp /var/spool/pbs/spool/339.mgmt.OU @login:/nfs/home//name.out
If this is causing issue check with combination of -Brvp is failing, then write a wrapper to /usr/bin/scp (scp.sh) which corrects this and add that line to /etc/pbs.conf agains PBS_SCP=/usr/bin/scp_wrapper.sh
in the /etc/ssh/ssh_config , set StrictHostKeyChecking no
It seems when user(s) login then they have to accept the hostkey and this is blocking the file copy (guessing).
I see, this could be it. If there is a prompt to OK the host address that would stop any script working probably, unless they went onto the node manually to fix it. I’m surprised the -B option doesn’t do this. I’ll try setting this on the compute nodes and seeing what happens…