My job stay queued

Deleting jobs as the root user is perfectly fine. Submitting jobs as root should be avoided. The root account on the server node is always considered a manager.

For compute-0-6, please confirm you can communicate with from the server node. First try to ping it, then try to ssh to it. PBS Pro can’t function properly if your network isn’t working.

For your other compute nodes, please confirm that you can ssh back and forth with the server without a password as the user submitting jobs.

This post may help: Qstat: cannot connect to server amaster (errno=113)

Dear Michael,

I tried pinging the compute nodes (1-6)

nodes 1 -5 was successful

However when I try ssh to lets say node 1 , I get the message below

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:1XCBEBwL4CIsAB+XU1uGM8borPm6WR1p+V1isuiNRFE.
Please contact your system administrator.
Add correct host key in /home/testuser/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/testuser/.ssh/known_hosts:26
Password authentication is disabled to avoid man-in-the-middle attacks.
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
Permission denied (publickey,password,keyboard-interactive).

Vincent Appiah

@vincent718

  1. Please share us which type of authentication you are using for password-less ssh
    a. user-ssh key based
    b. hostbased-ssh key based

  2. Please make sure you have static IP / hostname setup for your cluster and all the address resolution happens in the /etc/hosts file (first).

  3. The suspect for the above issue might be one of the below:
    a. ssh_server is upgraded and the ssh_config and sshd_config file has been modified
    b. if not . (a) the keys must have changed or corrupted respectively
    c. known_hosts file might have to be recreated.

Hope this helps and resolves your issue

Hi Adarsh and Michael, Thank you very much for the support

Let me take you through our setup.

We have a login node, 2 master nodes and 6 compute nodes

The 2 master nodes are configured with pacemaker for failover. We use ROCKS provisioning software and PBS for job submission. Users only have access to login node (with static IP address) from which jobs are submitted. A user can only access a compute node if the user’s job is being run on that node.

User accounts are created on the master node and synced to the other nodes.

As I indicated everything was working perfectly until I did the restart of the system. But before that the system had been configured to have additional authentication from our active directory.

When user accounts are being created a passphrase ssh key is requested for the first time but we leave it blank.

To answer the questions

  1. We use user-based key ( I am not quite sure about this. The one who set things up cannot be reached)

  2. We use static IP address

a.ssh_server has not been upgraded. ssh_config and sshd_config have been modified alright but we still were able to use the system without any issues until we did the restart

I suspect the problem has got to do with either b. or c.

Another weird thing I see is that , jobs get submitted alright, but they are queued even though the correct resources are specified in the job submission script.

I will grateful if you can provide possible solutions .

Thank you

Vincent

Hi Adarsh and Michael,

Thank you for the help, I realized that on the master node, the ROCKS mysql service had stopped and that was the cause of the problem. I had to restart the the two master nodes again and things were back to normal. Thanks once again

Vincent Appiah

2 Likes