Host key verification failed

Hi I have new cluster setup with PBS PRO CE 19.1.3. I am running a simple script but the job fails with the errors. What could have gone wrong?
vamshi@master01:~$ pbsnodes -aSj
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs


node01 free 0 0 0 126gb/126gb 8/8 0/0 0/0 –
master01 free 0 0 0 126gb/126gb 1/1 0/0 3/3 –
node02 free 0 0 0 126gb/126gb 20/20 0/0 3/3 –
node03 free 0 0 0 126gb/126gb 20/20 0/0 3/3 –

test1.sh:
#!/bin/bash
#PBS -l walltime=1:00

#PBS -l nodes=4 <β€” legacy

#PBS -l select=3:ncpus=1
echo -n "I am on: "
hostname;
echo finding ssh-accessible nodes:
for node in $(cat ${PBS_NODEFILE}) ; do
echo -n "running on: "
/usr/bin/ssh $node hostname
done

test1.sh.o1006:
I am on: node01
finding ssh-accessible nodes:
running on: running on: running on:

test1.e1006:
Host key verification failed.
Host key verification failed.
Host key verification failed.

@adarsh Can you please have a look at it.

Hi,
I am not sure, but its look like You have problem with host-based authentication.
Did You try connect via ssh from command line from one host(e.g. node01) to the other? Did You logon without entering password?

Please edit the /etc/ssh/ssh_config. and set StrictHostKeyChecking to. no. ( please check the correct syntax and caps on that name) . Set the same on the server and compute nodes.

Yes. I am able to log in without entering the password when ssh from one node to another.

Yes. I have added that line only on master node. Now added on compute nodes too and it’s working. My mistake. Thank you.
One more thing. Here, with this line #PBS -l select=3 I intend to run the job on three nodes. So out put should be something like below:
I am on: node01
finding ssh-accessible nodes:
running on: node01
running on: node02
running on: node03

But it’s always run only one node like
I am on: node01
finding ssh-accessible nodes:
running on: node01
running on: node01
running on: node01

Is something wrong with the configuration below?

vamshi@master01:~$ pbsnodes -a
node01
     Mom = node01.cm.cluster
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node01
     resources_available.mem = 131892920kb
     resources_available.ncpus = 20
     resources_available.vnode = node01
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Wed Jun 16 23:45:11 2021
     last_used_time = Thu Jun 17 00:03:56 2021

node02
     Mom = node02.cm.cluster
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node02
     resources_available.mem = 131893224kb
     resources_available.ncpus = 20
     resources_available.vnode = node02
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Wed Jun 16 23:45:11 2021

node03
     Mom = node03.cm.cluster
     ntype = PBS
     state = free
     pcpus = 40
     resources_available.arch = linux
     resources_available.host = node03
     resources_available.mem = 131893232kb
     resources_available.ncpus = 20
     resources_available.vnode = node03
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = workq
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Wed Jun 16 23:45:11 2021

Please try these lines
#PBS -l select=3:ncpus=1
#PBS -l place=scatter

Hope this works out for you.

That works. Thank you.
Why does #PBS -l select=3:ncpus=1 alone did not work?
and if I specify ngpus=2, it shows ngpus available =0. How do I configure ngpus?

By default PBS chooses to use either free or pack placement, so that once the node is completely packed then it will pick the next node.

Please check the PBS Professional 2021.1 User’s Guide on UG-65
Table 4-3: Placement Modifiers

qmgr : create resource ngpus type=long,flag=nh
Add ngpus to the resources: line in the $PBS_HOME/sched_priv/sched_config
kill -HUP
qmgr : set node NODENAME resources_available.ngpus=2
NODENAME = hostname of the node
2 is number of gpus available. on that node

Thank you very much @adarsh @boboshaq
You are really helpful to me. The issues are resolved.
I am glad to have this community. I hope I can contribute back.

1 Like