Cannot run on an exec node, seem related to ssh/scp

Dear friends,
I have 7 nodes: node 1 is master node, and nodes 2-7 execution nodes.
I can submit jobs to node 1, and get the results i want.
However, when I submit a job designating any of nodes 2-7 to perform (i.e. except the master node), it gets into running but does not produce the results.

For example I designated node 2 to run a simple task, and it won’t produce the results to the working folder. i look up into the mom logs:

[user1@node02 user1]$ vi /var/spool/pbs/mom_logs/20221214

and found these scripts:

12/14/2022 13:51:09;0080;pbs_mom;Fil;sys_copy;command: /opt/pbs/sbin/pbs_rcp -rp /var/spool/pbs/spool/23.node.OU user1@node:/home2/user1/testing.o23 status=1, try=4
12/14/2022 13:51:30;0001;pbs_mom;Fil;copy_file;Job 23.node: sys_copy failed, return value=1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;Unable to copy file /var/spool/pbs/spool/23.node.OU to node:/home2/user1/testing.o23
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;node: Connection refused
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;ssh host node, user user1, command scp -v -r -p -t /home2/user1/testing.o23
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Reading configuration data /etc/ssh/ssh_config
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: configuration requests final Match pass
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: re-parsing configuration
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Reading configuration data /etc/ssh/ssh_config
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Connecting to node [10.2.208.101] port 22.
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Connection established.
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_rsa type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_rsa-cert type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_dsa type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_dsa-cert type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_ecdsa type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_ecdsa-cert type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_ed25519 type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_ed25519-cert type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_xmss type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: identity file /home/user1/.ssh/id_xmss-cert type -1
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Local version string SSH-2.0-OpenSSH_8.0
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Remote protocol version 2.0, remote software version OpenSSH_8.0
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: match: OpenSSH_8.0 pat OpenSSH* compat 0x04000000
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Authenticating to node:22 as 'user1'
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: SSH2_MSG_KEXINIT sent
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: SSH2_MSG_KEXINIT received
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: kex: algorithm: curve25519-sha256
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: kex: host key algorithm: ecdsa-sha2-nistp256
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: kex: server->client cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: kex: client->server cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: kex: curve25519-sha256 need=32 dh_need=32
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: kex: curve25519-sha256 need=32 dh_need=32
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;debug1: Server host key: ecdsa-sha2-nistp256 SHA256:iTVQj0N976KNZrShSECPYEnKggchsu0ZBoNOCuul1L8
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;Host key verification failed.
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;lost connection
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;ication failed.
12/14/2022 13:51:30;0004;pbs_mom;Fil;23.node.OU;lost connection
12/14/2022 13:51:30;0001;pbs_mom;Svr;pbs_mom;No such file or directory (2) in is_child_path, Failed to allocate memory
12/14/2022 13:51:30;0001;pbs_mom;Fil;stage_file;Job 23.node: no wildcards:remote stageout failed for user1 from /var/spool/pbs/spool/23.node.OU to node:/home2/user1/testing.o23

when I trace job-23, it says:

[user1@node02 user1]$ tracejob 23

Job: 23.mybay

12/14/2022 13:49:55  M    Started, pid = 12357
12/14/2022 13:49:55  M    task 00000001 terminated
12/14/2022 13:49:55  M    Terminated
12/14/2022 13:49:55  M    task 00000001 cput=00:00:00
12/14/2022 13:49:55  M    kill_job
12/14/2022 13:49:55  M    node02 cput=00:00:00 mem=0kb
12/14/2022 13:49:55  M    Obit sent
12/14/2022 13:49:56  M    copy file request received
12/14/2022 13:51:30  M    Unable to copy file /var/spool/pbs/spool/23.mybay.OU to mybay:/home2/user1/testing.o23
12/14/2022 13:51:30  M    mybay: Connection refused
12/14/2022 13:51:30  M    ssh host mybay, user user1, command scp -v -r -p -t /home2/user1/testing.o23
12/14/2022 13:51:30  M    OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
12/14/2022 13:51:30  M    debug1: configuration requests final Match pass
12/14/2022 13:51:30  M    debug1: re-parsing configuration
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
12/14/2022 13:51:30  M    debug1: Connecting to mybay [10.2.208.101] port 22.
12/14/2022 13:51:30  M    debug1: Connection established.
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_rsa type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_rsa-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_dsa type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_dsa-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ecdsa type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ecdsa-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ed25519 type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ed25519-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_xmss type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_xmss-cert type -1
12/14/2022 13:51:30  M    debug1: Local version string SSH-2.0-OpenSSH_8.0
12/14/2022 13:51:30  M    debug1: Remote protocol version 2.0, remote software version OpenSSH_8.0
12/14/2022 13:51:30  M    debug1: match: OpenSSH_8.0 pat OpenSSH* compat 0x04000000
12/14/2022 13:51:30  M    debug1: Authenticating to mybay:22 as 'user1'
12/14/2022 13:51:30  M    debug1: SSH2_MSG_KEXINIT sent
12/14/2022 13:51:30  M    debug1: SSH2_MSG_KEXINIT received
12/14/2022 13:51:30  M    debug1: kex: algorithm: curve25519-sha256
12/14/2022 13:51:30  M    debug1: kex: host key algorithm: ecdsa-sha2-nistp256
12/14/2022 13:51:30  M    debug1: kex: server->client cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
12/14/2022 13:51:30  M    debug1: kex: client->server cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
12/14/2022 13:51:30  M    debug1: kex: curve25519-sha256 need=32 dh_need=32
12/14/2022 13:51:30  M    debug1: kex: curve25519-sha256 need=32 dh_need=32
12/14/2022 13:51:30  M    debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
12/14/2022 13:51:30  M    debug1: Server host key: ecdsa-sha2-nistp256 SHA256:iTVQj0N976KNZrShSECPYEnKggchsu0ZBoNOCuul1L8
12/14/2022 13:51:30  M    Host key verification failed.
12/14/2022 13:51:30  M    lost connection
12/14/2022 13:51:30  M    ication failed.
12/14/2022 13:51:30  M    lost connection
12/14/2022 13:51:30  M    Job files not copied:---->>>>
12/14/2022 13:51:30  M    Unable to copy file /var/spool/pbs/spool/23.mybay.OU to mybay:/home2/user1/testing.o23
12/14/2022 13:51:30  M    >>> error from copy
12/14/2022 13:51:30  M    mybay: Connection refused
12/14/2022 13:51:30  M    ssh host mybay, user user1, command scp -v -r -p -t /home2/user1/testing.o23
12/14/2022 13:51:30  M    OpenSSH_8.0p1, OpenSSL 1.1.1k  FIPS 25 Mar 2021
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
12/14/2022 13:51:30  M    debug1: configuration requests final Match pass
12/14/2022 13:51:30  M    debug1: re-parsing configuration
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/ssh/ssh_config.d/05-redhat.conf
12/14/2022 13:51:30  M    debug1: Reading configuration data /etc/crypto-policies/back-ends/openssh.config
12/14/2022 13:51:30  M    debug1: Connecting to mybay [10.2.208.101] port 22.
12/14/2022 13:51:30  M    debug1: Connection established.
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_rsa type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_rsa-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_dsa type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_dsa-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ecdsa type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ecdsa-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ed25519 type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_ed25519-cert type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_xmss type -1
12/14/2022 13:51:30  M    debug1: identity file /home/user1/.ssh/id_xmss-cert type -1
12/14/2022 13:51:30  M    debug1: Local version string SSH-2.0-OpenSSH_8.0
12/14/2022 13:51:30  M    debug1: Remote protocol version 2.0, remote software version OpenSSH_8.0
12/14/2022 13:51:30  M    debug1: match: OpenSSH_8.0 pat OpenSSH* compat 0x04000000
12/14/2022 13:51:30  M    debug1: Authenticating to mybay:22 as 'user1'
12/14/2022 13:51:30  M    debug1: SSH2_MSG_KEXINIT sent
12/14/2022 13:51:30  M    debug1: SSH2_MSG_KEXINIT received
12/14/2022 13:51:30  M    debug1: kex: algorithm: curve25519-sha256
12/14/2022 13:51:30  M    debug1: kex: host key algorithm: ecdsa-sha2-nistp256
12/14/2022 13:51:30  M    debug1: kex: server->client cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
12/14/2022 13:51:30  M    debug1: kex: client->server cipher: aes256-gcm@openssh.com MAC: <implicit> compression: none
12/14/2022 13:51:30  M    debug1: kex: curve25519-sha256 need=32 dh_need=32
12/14/2022 13:51:30  M    debug1: kex: curve25519-sha256 need=32 dh_need=32
12/14/2022 13:51:30  M    debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
12/14/2022 13:51:30  M    debug1: Server host key: ecdsa-sha2-nistp256 SHA256:iTVQj0N976KNZrShSECPYEnKggchsu0ZBoNOCuul1L8
12/14/2022 13:51:30  M    Host key verification failed.
12/14/2022 13:51:30  M    lost connection
12/14/2022 13:51:30  M    ication failed.
12/14/2022 13:51:30  M    lost connection
12/14/2022 13:51:30  M    >>> end error output
12/14/2022 13:51:30  M    Output retained on that host in: /var/spool/pbs/undelivered/23.mybay.OU
12/14/2022 13:51:30  M    ---->>>>
12/14/2022 13:51:30  M    Staged 0/1 items out over 0:01:34
12/14/2022 13:51:30  M    no active tasks
12/14/2022 13:51:30  M    Obit sent
12/14/2022 13:51:30  M    delete job request received
12/14/2022 13:51:30  M    kill_job
12/14/2022 13:51:30  M    delete job request received

May I know what would be the problem and how to fix it?
Thanks

Best
Austin

Is passwordless ssh working for that user seamlessly ?
Could you please disable StrictHostkeyChecking on masternode and compute nodes and try again ?
Could you please share the job script ?

Dear Adarsh,

Thank you.
How can I set up passwordless ssh? any page to refer to?

The job script is as below:

#!/bin/sh
#PBS -l nodes=node02:ppn=2
#PBS -l walltime=00:10:00
#PBS -q medium
#PBS -j oe
#PBS -N testing

for i in $(seq 1 20)
do
        date
        echo $i
done
echo "hi i am in my first job"
echo date

Thank you!

Austin

Thank you for sharing the script

Your script looks good to me, but you can switch to using the latest with select and ncpus

For passwordless ssh, please refer:

Dear Adarsh,

(Q1) Do I have to enable passwordless ssh for each user, so that they can all submit to nodes other than the login node?

(Q2) If so, will enabling passwordless ssh for each user pose dangers, as the users can now ‘hop’ around the nodes freely as they want?

(Q3) In a separate HPC which uses openPBS and where I am an ordinary user, I cannot hop to the execution nodes using passwordless ssh.

Best
Austin

Hi @austin, please find the answers to your queries below:

Yes for all the users, passwordless ssh should work between PBS Server and Compute Nodes and between Compute Nodes.

Please check the PBS Pro Administrator guide and $restrict_user mom config directive to dissallow users directly ssh’ing into compute nodes.

Sure completely understand , please check $restrict_user options availalbe to be configured on the $PBS_HOME/mom_priv/config (on all the compute nodes).

Thank you,
Adarsh

Dear Adarsh,

Thank you. all problems solved.

BTW, for the preventing users from hopping into execution node, I found a better way than setting “$restrict_user True”, which might helpful for other readers of this post:

Just create the user on the execution nodes with different passwords than that assigned to the user on the login node. The user will then not be able to log in to the execution nodes using his/her login passwd. But since we have set up the passwordless ssh from the execution nodes to the login nodes, the user can ssh to the login node from the execution nodes. It is asymmetric. and that is great.

So for example if you are admin on node2, and you want to check everything about a user user1, you switch to its account, without getting killed unlike the $restrict_user True scenario.

Thank you
Austin