My job stay queued

Hi,

Since today and I don’t know why, but when I submit a job it is staying queued.

what should I do to understand what is the problem ?

Yesterday all runing perfectly.

Thank a lot for your help

Please share the output of the below commands

  1. qstat -answ1
  2. pbsnodes -av

Note;

  • please check whether all the nodes status is free ? pbsnodes -av | grep -e Mom -e state
  • please check whether the job requests can be matched on to the compute resources , otherwise, job will be in the queue ?

$ qstat -answ1

return nothing

$ pbsnodes -av

return :

centos7
Mom = centos7.home
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = centos7
resources_available.ncpus = 1
resources_available.vnode = centos7
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

$ pbsnodes -av | grep -e Mom -e state

return

 Mom = centos7.home
 state = state-unknown,down

please check whether the job requests can be matched on to the compute resources , otherwise, job will be in the queue ?

I submit jobs I have already submitted, so normally compute resources are ok.

I do not understand the problem.

Then there is no jobs in the queue. Hence, please submit sample sleep jobs as below
qsub – /bin/sleep 100

The status of the compute node is down, hence job is still in the queue.

Why the node is down - communication issues between the PBS Server and PBS Mom (Compute Node)

  1. pbs_mom service might not be running on the compute node centos7.home
  2. firewall might be blocking the ports ( 15001 - 15007 , 17001 ) between the headnode and compute node (vice versa) . Disable firewall completely and check .
  3. DNS resolution ( forward and reverse resolution of the compute node / headnode ) from respective systems.
  4. Check SELinux is disabled and system is rebooted after disabling the SELinux

Hi. I also have a problem. When I write the job submission script and specify a particular node name, the job stays in a queue. #PBS -l nodes=compunode-0-3.local

After submitting the job which stays in the queue, i use this command qstat -answ1 i get this error
Can Never Run: Insufficient amount of resource: host (compunode-0-3.local !=compunode-0-1,compunode-0-2,compunode.-0-3,…

We have the followiing restrictions on the server for every user(PBS_GENERIC)

max_run=3
max_run_res.ncpus=72
max_run_res.nodect=2
max_queued=2

Please share the output of the below command
pbsnodes computenode-0-3.local

Can you please try this command:
qsub -l host=computenode-0-3 – /bin/sleep 100

It seems the “mom name” is not matching the request.

  • mom name should have the short name , please check

These are the results.

[user@login ~]# pbsnodes compute-0-3.local
Node: compute-0-3.local, Error: Unknown node
[user@login ~]# pbsnodes compute-0-3
compute-0-3.local
Mom = compute-0-3.local
ntype = PBS
state = free
pcpus = 36
resources_available.arch = linux
resources_available.host = compute-0-3
resources_available.mem = 263727076kb
resources_available.ncpus = 36
resources_available.ngpus = 2
resources_available.vnode = compute-0-3.local
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

[user@login ~]$ qsub -l host=compute-0-3 – /bin/sleep 100
usage: qsub [-a date_time] [-A account_string] [-c interval]
[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
[-k keep] [-l resource_list] [-m mail_options] [-M user_list]
[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
[-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value…]
[-S path] [-u user_list] [-W otherattributes=value…]
[-v variable_list] [-V ] [-z] [script | – command [arg1 …]]
qsub --version

The command should be

qsub -l host=compute-0-3 - - /bin/sleep 100

qsub < hyphen >< l for london >< space >host=compute-0-3< space > < hyphen >< hyphen >< space> /bin/sleep 1000

This is the command and output

qsub -l host=compute-0-3 – bin/sleep 100
80126.master1.local

Thank you !

  • did the job run on the requested host ?
    please share the output of
  • qstat -answ1
  • qstat -fx 80126

Yes it did run. Here are the other outputs
qstat -answ1

100363.master1.local user workq STDIN 17184 1 1 – – R 00:00:00 compute-0-3
Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1)

qstat -fx 100363
Job Id: 100363.master1.local
Job_Name = STDIN
Job_Owner = user@login.local
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 348kb
resources_used.ncpus = 1
resources_used.vmem = 4316kb
resources_used.walltime = 00:01:40
job_state = F
queue = workq
server = master1.local
Checkpoint = u
ctime = Tue Dec 11 15:58:25 2018
Error_Path = login.local:/home/user/STDIN.e100363
exec_host = compute-0-3.local/0
exec_vnode = (compute-0-3.local:ncpus=1)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Dec 11 16:00:06 2018
Output_Path = login.local:/home/user/STDIN.o100363
Priority = 0
qtime = Tue Dec 11 15:58:25 2018
Rerunable = True
Resource_List.host = compute-0-3
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.select = 1:host=compute-0-3:ncpus=1
stime = Tue Dec 11 15:58:25 2018
session_id = 17184
jobdir = /home/user
substate = 92
Variable_List = PBS_O_HOME=/home/user,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=user-l host=compute-0-3:ncpus=10
,
PBS_O_PATH=/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/mp
i/intel64/bin:/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/
bin/intel64:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibut
ils/bin:/opt/pbs/bin:/home/user/.local/bin:/home/user/bin,
PBS_O_MAIL=/var/spool/mail/user,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/user,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=login.local
comment = Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1) and
finished
etime = Tue Dec 11 15:58:25 2018
run_count = 1
Stageout_status = 1
Exit_status = 0
Submit_arguments = -l host=compute-0-3 – /bin/sleep 100
executable = jsdl-hpcpa:Executable/bin/sleep</jsdl-hpcpa:Executable>
argument_list = jsdl-hpcpa:Argument100</jsdl-hpcpa:Argument>
history_timestamp = 1544544006
project = _pbs_project_default

Thank you. It is all working now.
Do you still see any issues ?

But what is the syntax for selecting nodes in PBS?
I am using pbs pro version 17 and i get an error when i use this command below
#PBS -l host=compute-0-3:ncpus=10 -l mem=10GB
The error is
Illegal attribute or resource value Resource_List.select

I got it now . The correct syntax is

#PBS -l host=compute-0-3 -l ncpus=10 -l mem=2GB

Thank you very much adarsh.

1 Like

I have detected an issue.

When i enter the host name in job submission script, the job is submitted to a different host( .
eg. if i select compute node 5 #PBS -l host=compute-0-5 -l ncpus=18 -l mem=32GB

The job gets submitted to a compute node which is free ( starting from 1, 2, or 3 or higher).

But when i select the node using the interactive option such as
qsub -I -l host=compute-0-5 -l ncpus=10 -l mem=2GB

This rather works

Submit a job to a particular host:

qsub -l select=1:ncpus=2:mem=32gb:host=compute-0-5 -- /bin/sleep 100

Submit a job and let PBS decide where to run:

qsub -l select=1:ncpus=2:mem=32gb -- /bin/sleep 100

To run 2 cpu jobs on two nodes requesting 16GB on each of the nodes

qsub -l select=2:ncpus=1:mem=16gb -l place=scatter -- /bin/sleep 100

To specifically run jobs on two specific hosts

qsub -l nodes=compute-0-3+compute-0-5 -- /bin/sleep 100

qsub -l select=1:ncpus=1:mem=10gb:host=compute-0-3+1:mem=10gb:host=compute-0-5 -- /bin/sleep 100

Please read the below admin guide section: 5.4.9 Job-wide vs. Chunk Resources

Hi Adash
I did a restart of our HPC system and now all jobs are in a queue.
I tried the following command ‘pdsh date’ and i got the values below

ompute-0-2: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-2: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-1: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-2: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-3: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-1: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-4: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: It is also possible that a host key has just been changed.
compute-0-2: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-3: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-4: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-1: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-2: SHA256:gFtWek5zWZJQCnNQwycLTUtZXiZudM5S9H+xkK860ok.
compute-0-3: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: It is also possible that a host key has just been changed.
compute-0-1: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-2: Please contact your system administrator.
compute-0-4: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: It is also possible that a host key has just been changed.
compute-0-3: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-2: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-1: The fingerprint for the ECDSA key sent by the remote host is
compute-0-4: SHA256:o+nbSQpNh7f7YyQwP3myiY9H0LtKhwHWLBTlkXfggkE.
compute-0-3: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-2: Offending ECDSA key in /root/.ssh/known_hosts:3
compute-0-1: SHA256:1XCBEBwL4CIsAB+XU1uGM8borPm6WR1p+V1isuiNRFE.
compute-0-4: Please contact your system administrator.
compute-0-2: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: It is also possible that a host key has just been changed.
compute-0-1: Please contact your system administrator.
compute-0-4: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-2: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-4: Offending ECDSA key in /root/.ssh/known_hosts:5
compute-0-3: SHA256:mq67n0i6+BiZPVWWeHt5iORDaYXIiqsrhvzYQjr2YkQ.
compute-0-1: Offending ECDSA key in /root/.ssh/known_hosts:1
compute-0-4: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Please contact your system administrator.
compute-0-1: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-4: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-1: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Offending ECDSA key in /root/.ssh/known_hosts:4
compute-0-3: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-5: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-5: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-5: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-5: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-5: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-5: It is also possible that a host key has just been changed.
compute-0-5: The fingerprint for the ECDSA key sent by the remote host is
compute-0-5: SHA256:Ziq6yZmX0BRFHNwCD5U5425ef0B6ja4q+Q5h1Exs12U.
compute-0-5: Please contact your system administrator.
compute-0-5: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6
compute-0-5: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-5: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-2: Fri Jan 24 17:38:38 UTC 2020
compute-0-3: Fri Jan 24 17:38:38 UTC 2020
compute-0-1: Fri Jan 24 17:38:39 UTC 2020
compute-0-4: Fri Jan 24 17:38:37 UTC 2020
compute-0-5: Fri Jan 24 17:38:38 UTC 2020
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host
pdsh@master: compute-0-6: ssh exited with exit code 255

I see two separate problems. One has to do with your ssh configuration…

You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.

The second problem is this…

This indicates there is something wrong with your network configuration.

Hi Michael,

What do you suggest I do? Everything was working perfectly until I did a restart of the server. By the way we have two master nodes and one of them works as a fail over( depends on which is available or not)

Vincent Appiah

mkaro

    January 24

I see two separate problems. One has to do with your ssh configuration…

vincent718:
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6

You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.

The second problem is this…

vincent718:
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host

This indicates there is something wrong with your network configuration.

Hello Michael,

Before I performed the restart, i used qdel $(select) to delete all existing jobs. This was done using the root account. Could that be the cause?

Hi Michael,

What do you suggest I do? Everything was working perfectly until I did a restart of the server. By the way we have two master nodes and one of them works as a fail over( depends on which is available or not)

Vincent Appiah

mkaro

    January 24

I see two separate problems. One has to do with your ssh configuration…

vincent718:
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6

You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.

The second problem is this…

vincent718:
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host

This indicates there is something wrong with your network configuration.