My job stay queued

Please share the output of the below command
pbsnodes computenode-0-3.local

Can you please try this command:
qsub -l host=computenode-0-3 – /bin/sleep 100

It seems the “mom name” is not matching the request.

  • mom name should have the short name , please check

These are the results.

[user@login ~]# pbsnodes compute-0-3.local
Node: compute-0-3.local, Error: Unknown node
[user@login ~]# pbsnodes compute-0-3
compute-0-3.local
Mom = compute-0-3.local
ntype = PBS
state = free
pcpus = 36
resources_available.arch = linux
resources_available.host = compute-0-3
resources_available.mem = 263727076kb
resources_available.ncpus = 36
resources_available.ngpus = 2
resources_available.vnode = compute-0-3.local
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

[user@login ~]$ qsub -l host=compute-0-3 – /bin/sleep 100
usage: qsub [-a date_time] [-A account_string] [-c interval]
[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
[-k keep] [-l resource_list] [-m mail_options] [-M user_list]
[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
[-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value…]
[-S path] [-u user_list] [-W otherattributes=value…]
[-v variable_list] [-V ] [-z] [script | – command [arg1 …]]
qsub --version

The command should be

qsub -l host=compute-0-3 - - /bin/sleep 100

qsub < hyphen >< l for london >< space >host=compute-0-3< space > < hyphen >< hyphen >< space> /bin/sleep 1000

This is the command and output

qsub -l host=compute-0-3 – bin/sleep 100
80126.master1.local

Thank you !

  • did the job run on the requested host ?
    please share the output of
  • qstat -answ1
  • qstat -fx 80126

Yes it did run. Here are the other outputs
qstat -answ1

100363.master1.local user workq STDIN 17184 1 1 – – R 00:00:00 compute-0-3
Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1)

qstat -fx 100363
Job Id: 100363.master1.local
Job_Name = STDIN
Job_Owner = user@login.local
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 348kb
resources_used.ncpus = 1
resources_used.vmem = 4316kb
resources_used.walltime = 00:01:40
job_state = F
queue = workq
server = master1.local
Checkpoint = u
ctime = Tue Dec 11 15:58:25 2018
Error_Path = login.local:/home/user/STDIN.e100363
exec_host = compute-0-3.local/0
exec_vnode = (compute-0-3.local:ncpus=1)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Dec 11 16:00:06 2018
Output_Path = login.local:/home/user/STDIN.o100363
Priority = 0
qtime = Tue Dec 11 15:58:25 2018
Rerunable = True
Resource_List.host = compute-0-3
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.select = 1:host=compute-0-3:ncpus=1
stime = Tue Dec 11 15:58:25 2018
session_id = 17184
jobdir = /home/user
substate = 92
Variable_List = PBS_O_HOME=/home/user,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=user-l host=compute-0-3:ncpus=10
,
PBS_O_PATH=/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/mp
i/intel64/bin:/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/
bin/intel64:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibut
ils/bin:/opt/pbs/bin:/home/user/.local/bin:/home/user/bin,
PBS_O_MAIL=/var/spool/mail/user,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/user,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=login.local
comment = Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1) and
finished
etime = Tue Dec 11 15:58:25 2018
run_count = 1
Stageout_status = 1
Exit_status = 0
Submit_arguments = -l host=compute-0-3 – /bin/sleep 100
executable = jsdl-hpcpa:Executable/bin/sleep</jsdl-hpcpa:Executable>
argument_list = jsdl-hpcpa:Argument100</jsdl-hpcpa:Argument>
history_timestamp = 1544544006
project = _pbs_project_default

Thank you. It is all working now.
Do you still see any issues ?

But what is the syntax for selecting nodes in PBS?
I am using pbs pro version 17 and i get an error when i use this command below
#PBS -l host=compute-0-3:ncpus=10 -l mem=10GB
The error is
Illegal attribute or resource value Resource_List.select

I got it now . The correct syntax is

#PBS -l host=compute-0-3 -l ncpus=10 -l mem=2GB

Thank you very much adarsh.

1 Like

I have detected an issue.

When i enter the host name in job submission script, the job is submitted to a different host( .
eg. if i select compute node 5 #PBS -l host=compute-0-5 -l ncpus=18 -l mem=32GB

The job gets submitted to a compute node which is free ( starting from 1, 2, or 3 or higher).

But when i select the node using the interactive option such as
qsub -I -l host=compute-0-5 -l ncpus=10 -l mem=2GB

This rather works

Submit a job to a particular host:

qsub -l select=1:ncpus=2:mem=32gb:host=compute-0-5 -- /bin/sleep 100

Submit a job and let PBS decide where to run:

qsub -l select=1:ncpus=2:mem=32gb -- /bin/sleep 100

To run 2 cpu jobs on two nodes requesting 16GB on each of the nodes

qsub -l select=2:ncpus=1:mem=16gb -l place=scatter -- /bin/sleep 100

To specifically run jobs on two specific hosts

qsub -l nodes=compute-0-3+compute-0-5 -- /bin/sleep 100

qsub -l select=1:ncpus=1:mem=10gb:host=compute-0-3+1:mem=10gb:host=compute-0-5 -- /bin/sleep 100

Please read the below admin guide section: 5.4.9 Job-wide vs. Chunk Resources

Hi Adash
I did a restart of our HPC system and now all jobs are in a queue.
I tried the following command ‘pdsh date’ and i got the values below

ompute-0-2: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-2: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-1: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-2: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-3: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-1: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-4: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: It is also possible that a host key has just been changed.
compute-0-2: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-3: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-4: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-1: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-2: SHA256:gFtWek5zWZJQCnNQwycLTUtZXiZudM5S9H+xkK860ok.
compute-0-3: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: It is also possible that a host key has just been changed.
compute-0-1: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-2: Please contact your system administrator.
compute-0-4: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: It is also possible that a host key has just been changed.
compute-0-3: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-2: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-1: The fingerprint for the ECDSA key sent by the remote host is
compute-0-4: SHA256:o+nbSQpNh7f7YyQwP3myiY9H0LtKhwHWLBTlkXfggkE.
compute-0-3: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-2: Offending ECDSA key in /root/.ssh/known_hosts:3
compute-0-1: SHA256:1XCBEBwL4CIsAB+XU1uGM8borPm6WR1p+V1isuiNRFE.
compute-0-4: Please contact your system administrator.
compute-0-2: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: It is also possible that a host key has just been changed.
compute-0-1: Please contact your system administrator.
compute-0-4: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-2: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-4: Offending ECDSA key in /root/.ssh/known_hosts:5
compute-0-3: SHA256:mq67n0i6+BiZPVWWeHt5iORDaYXIiqsrhvzYQjr2YkQ.
compute-0-1: Offending ECDSA key in /root/.ssh/known_hosts:1
compute-0-4: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Please contact your system administrator.
compute-0-1: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-4: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-1: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Offending ECDSA key in /root/.ssh/known_hosts:4
compute-0-3: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-5: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-5: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-5: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-5: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-5: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-5: It is also possible that a host key has just been changed.
compute-0-5: The fingerprint for the ECDSA key sent by the remote host is
compute-0-5: SHA256:Ziq6yZmX0BRFHNwCD5U5425ef0B6ja4q+Q5h1Exs12U.
compute-0-5: Please contact your system administrator.
compute-0-5: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6
compute-0-5: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-5: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-2: Fri Jan 24 17:38:38 UTC 2020
compute-0-3: Fri Jan 24 17:38:38 UTC 2020
compute-0-1: Fri Jan 24 17:38:39 UTC 2020
compute-0-4: Fri Jan 24 17:38:37 UTC 2020
compute-0-5: Fri Jan 24 17:38:38 UTC 2020
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host
pdsh@master: compute-0-6: ssh exited with exit code 255

I see two separate problems. One has to do with your ssh configuration…

You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.

The second problem is this…

This indicates there is something wrong with your network configuration.

Hi Michael,

What do you suggest I do? Everything was working perfectly until I did a restart of the server. By the way we have two master nodes and one of them works as a fail over( depends on which is available or not)

Vincent Appiah

mkaro

    January 24

I see two separate problems. One has to do with your ssh configuration…

vincent718:
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6

You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.

The second problem is this…

vincent718:
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host

This indicates there is something wrong with your network configuration.

Hello Michael,

Before I performed the restart, i used qdel $(select) to delete all existing jobs. This was done using the root account. Could that be the cause?

Hi Michael,

What do you suggest I do? Everything was working perfectly until I did a restart of the server. By the way we have two master nodes and one of them works as a fail over( depends on which is available or not)

Vincent Appiah

mkaro

    January 24

I see two separate problems. One has to do with your ssh configuration…

vincent718:
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6

You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.

The second problem is this…

vincent718:
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host

This indicates there is something wrong with your network configuration.

Deleting jobs as the root user is perfectly fine. Submitting jobs as root should be avoided. The root account on the server node is always considered a manager.

For compute-0-6, please confirm you can communicate with from the server node. First try to ping it, then try to ssh to it. PBS Pro can’t function properly if your network isn’t working.

For your other compute nodes, please confirm that you can ssh back and forth with the server without a password as the user submitting jobs.

This post may help: Qstat: cannot connect to server amaster (errno=113)

Dear Michael,

I tried pinging the compute nodes (1-6)

nodes 1 -5 was successful

However when I try ssh to lets say node 1 , I get the message below

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that a host key has just been changed.
The fingerprint for the ECDSA key sent by the remote host is
SHA256:1XCBEBwL4CIsAB+XU1uGM8borPm6WR1p+V1isuiNRFE.
Please contact your system administrator.
Add correct host key in /home/testuser/.ssh/known_hosts to get rid of this message.
Offending ECDSA key in /home/testuser/.ssh/known_hosts:26
Password authentication is disabled to avoid man-in-the-middle attacks.
Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
Permission denied (publickey,password,keyboard-interactive).

Vincent Appiah

@vincent718

  1. Please share us which type of authentication you are using for password-less ssh
    a. user-ssh key based
    b. hostbased-ssh key based

  2. Please make sure you have static IP / hostname setup for your cluster and all the address resolution happens in the /etc/hosts file (first).

  3. The suspect for the above issue might be one of the below:
    a. ssh_server is upgraded and the ssh_config and sshd_config file has been modified
    b. if not . (a) the keys must have changed or corrupted respectively
    c. known_hosts file might have to be recreated.

Hope this helps and resolves your issue

Hi Adarsh and Michael, Thank you very much for the support

Let me take you through our setup.

We have a login node, 2 master nodes and 6 compute nodes

The 2 master nodes are configured with pacemaker for failover. We use ROCKS provisioning software and PBS for job submission. Users only have access to login node (with static IP address) from which jobs are submitted. A user can only access a compute node if the user’s job is being run on that node.

User accounts are created on the master node and synced to the other nodes.

As I indicated everything was working perfectly until I did the restart of the system. But before that the system had been configured to have additional authentication from our active directory.

When user accounts are being created a passphrase ssh key is requested for the first time but we leave it blank.

To answer the questions

  1. We use user-based key ( I am not quite sure about this. The one who set things up cannot be reached)

  2. We use static IP address

a.ssh_server has not been upgraded. ssh_config and sshd_config have been modified alright but we still were able to use the system without any issues until we did the restart

I suspect the problem has got to do with either b. or c.

Another weird thing I see is that , jobs get submitted alright, but they are queued even though the correct resources are specified in the job submission script.

I will grateful if you can provide possible solutions .

Thank you

Vincent

Hi Adarsh and Michael,

Thank you for the help, I realized that on the master node, the ROCKS mysql service had stopped and that was the cause of the problem. I had to restart the the two master nodes again and things were back to normal. Thanks once again

Vincent Appiah

2 Likes