Host key verification failed

vamshi · June 15, 2021, 9:23pm

Hi I have new cluster setup with PBS PRO CE 19.1.3. I am running a simple script but the job fails with the errors. What could have gone wrong?
vamshi@master01:~$ pbsnodes -aSj
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs

node01 free 0 0 0 126gb/126gb 8/8 0/0 0/0 –
master01 free 0 0 0 126gb/126gb 1/1 0/0 3/3 –
node02 free 0 0 0 126gb/126gb 20/20 0/0 3/3 –
node03 free 0 0 0 126gb/126gb 20/20 0/0 3/3 –

test1.sh:
#!/bin/bash
#PBS -l walltime=1:00

#PBS -l nodes=4 <— legacy

#PBS -l select=3:ncpus=1
echo -n "I am on: "
hostname;
echo finding ssh-accessible nodes:
for node in $(cat ${PBS_NODEFILE}) ; do
echo -n "running on: "
/usr/bin/ssh $node hostname
done

test1.sh.o1006:
I am on: node01
finding ssh-accessible nodes:
running on: running on: running on:

test1.e1006:
Host key verification failed.
Host key verification failed.
Host key verification failed.

vamshi · June 16, 2021, 6:04am

@adarsh Can you please have a look at it.

boboshaq · June 16, 2021, 9:52am

Hi,
I am not sure, but its look like You have problem with host-based authentication.
Did You try connect via ssh from command line from one host(e.g. node01) to the other? Did You logon without entering password?

adarsh · June 16, 2021, 11:33am

Please edit the /etc/ssh/ssh_config. and set StrictHostKeyChecking to. no. ( please check the correct syntax and caps on that name) . Set the same on the server and compute nodes.

vamshi · June 16, 2021, 6:24pm

Yes. I am able to log in without entering the password when ssh from one node to another.

vamshi · June 16, 2021, 6:39pm

Yes. I have added that line only on master node. Now added on compute nodes too and it’s working. My mistake. Thank you.
One more thing. Here, with this line #PBS -l select=3 I intend to run the job on three nodes. So out put should be something like below:
I am on: node01
finding ssh-accessible nodes:
running on: node01
running on: node02
running on: node03

But it’s always run only one node like
I am on: node01
finding ssh-accessible nodes:
running on: node01
running on: node01
running on: node01

Is something wrong with the configuration below?

vamshi@master01:~$ pbsnodes -a
node01
     Mom = node01.cm.cluster
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node01
     resources_available.mem = 131892920kb
     resources_available.ncpus = 20
     resources_available.vnode = node01
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Wed Jun 16 23:45:11 2021
     last_used_time = Thu Jun 17 00:03:56 2021

node02
     Mom = node02.cm.cluster
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node02
     resources_available.mem = 131893224kb
     resources_available.ncpus = 20
     resources_available.vnode = node02
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Wed Jun 16 23:45:11 2021

node03
     Mom = node03.cm.cluster
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node03
     resources_available.mem = 131893232kb
     resources_available.ncpus = 20
     resources_available.vnode = node03
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Wed Jun 16 23:45:11 2021

adarsh · June 16, 2021, 9:02pm

Please try these lines
#PBS -l select=3:ncpus=1
#PBS -l place=scatter

Hope this works out for you.

vamshi · June 17, 2021, 4:25am

That works. Thank you.
Why does #PBS -l select=3:ncpus=1 alone did not work?
and if I specify ngpus=2, it shows ngpus available =0. How do I configure ngpus?

adarsh · June 17, 2021, 9:01am

By default PBS chooses to use either free or pack placement, so that once the node is completely packed then it will pick the next node.

Please check the PBS Professional 2021.1 User’s Guide on UG-65
Table 4-3: Placement Modifiers

qmgr : create resource ngpus type=long,flag=nh
Add ngpus to the resources: line in the $PBS_HOME/sched_priv/sched_config
kill -HUP
qmgr : set node NODENAME resources_available.ngpus=2
NODENAME = hostname of the node
2 is number of gpus available. on that node

vamshi · June 17, 2021, 5:47pm

Thank you very much @adarsh @boboshaq
You are really helpful to me. The issues are resolved.
I am glad to have this community. I hope I can contribute back.

Topic		Replies	Views
Can't access to the nodes Users/Site Administrators	4	1317	May 13, 2019
No Password Entry for User error Users/Site Administrators	6	1551	March 15, 2022
Cannot run job on multiple nodes Users/Site Administrators	5	3573	March 21, 2019
Cannot connect to server, errno=111 Users/Site Administrators	1	2425	May 27, 2020
Unable to submit the job on compute node(node01) Users/Site Administrators	18	5425	March 18, 2022

Host key verification failed

#PBS -l nodes=4 <— legacy

Related topics