Hi,
Since today and I don’t know why, but when I submit a job it is staying queued.
what should I do to understand what is the problem ?
Yesterday all runing perfectly.
Thank a lot for your help
Hi,
Since today and I don’t know why, but when I submit a job it is staying queued.
what should I do to understand what is the problem ?
Yesterday all runing perfectly.
Thank a lot for your help
Please share the output of the below commands
Note;
$ qstat -answ1
return nothing
$ pbsnodes -av
return :
centos7
Mom = centos7.home
ntype = PBS
state = state-unknown,down
pcpus = 1
resources_available.host = centos7
resources_available.ncpus = 1
resources_available.vnode = centos7
resources_assigned.accelerator_memory = 0kb
resources_assigned.hbmem = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
$ pbsnodes -av | grep -e Mom -e state
return
Mom = centos7.home
state = state-unknown,down
please check whether the job requests can be matched on to the compute resources , otherwise, job will be in the queue ?
I submit jobs I have already submitted, so normally compute resources are ok.
I do not understand the problem.
Then there is no jobs in the queue. Hence, please submit sample sleep jobs as below
qsub – /bin/sleep 100
The status of the compute node is down, hence job is still in the queue.
Why the node is down - communication issues between the PBS Server and PBS Mom (Compute Node)
Hi. I also have a problem. When I write the job submission script and specify a particular node name, the job stays in a queue. #PBS -l nodes=compunode-0-3.local
After submitting the job which stays in the queue, i use this command qstat -answ1 i get this error
Can Never Run: Insufficient amount of resource: host (compunode-0-3.local !=compunode-0-1,compunode-0-2,compunode.-0-3,…
We have the followiing restrictions on the server for every user(PBS_GENERIC)
max_run=3
max_run_res.ncpus=72
max_run_res.nodect=2
max_queued=2
Please share the output of the below command
pbsnodes computenode-0-3.local
Can you please try this command:
qsub -l host=computenode-0-3 – /bin/sleep 100
It seems the “mom name” is not matching the request.
These are the results.
[user@login ~]# pbsnodes compute-0-3.local
Node: compute-0-3.local, Error: Unknown node
[user@login ~]# pbsnodes compute-0-3
compute-0-3.local
Mom = compute-0-3.local
ntype = PBS
state = free
pcpus = 36
resources_available.arch = linux
resources_available.host = compute-0-3
resources_available.mem = 263727076kb
resources_available.ncpus = 36
resources_available.ngpus = 2
resources_available.vnode = compute-0-3.local
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.ngpus = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared
[user@login ~]$ qsub -l host=compute-0-3 – /bin/sleep 100
usage: qsub [-a date_time] [-A account_string] [-c interval]
[-C directive_prefix] [-e path] [-f ] [-h ] [-I [-X]] [-j oe|eo] [-J X-Y[:Z]]
[-k keep] [-l resource_list] [-m mail_options] [-M user_list]
[-N jobname] [-o path] [-p priority] [-P project] [-q queue] [-r y|n]
[-R o|e|oe] [-S path] [-u user_list] [-W otherattributes=value…]
[-S path] [-u user_list] [-W otherattributes=value…]
[-v variable_list] [-V ] [-z] [script | – command [arg1 …]]
qsub --version
The command should be
qsub -l host=compute-0-3 - - /bin/sleep 100
qsub < hyphen >< l for london >< space >host=compute-0-3< space > < hyphen >< hyphen >< space> /bin/sleep 1000
This is the command and output
qsub -l host=compute-0-3 – bin/sleep 100
80126.master1.local
Thank you !
Yes it did run. Here are the other outputs
qstat -answ1
100363.master1.local user workq STDIN 17184 1 1 – – R 00:00:00 compute-0-3
Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1)
qstat -fx 100363
Job Id: 100363.master1.local
Job_Name = STDIN
Job_Owner = user@login.local
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 348kb
resources_used.ncpus = 1
resources_used.vmem = 4316kb
resources_used.walltime = 00:01:40
job_state = F
queue = workq
server = master1.local
Checkpoint = u
ctime = Tue Dec 11 15:58:25 2018
Error_Path = login.local:/home/user/STDIN.e100363
exec_host = compute-0-3.local/0
exec_vnode = (compute-0-3.local:ncpus=1)
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Tue Dec 11 16:00:06 2018
Output_Path = login.local:/home/user/STDIN.o100363
Priority = 0
qtime = Tue Dec 11 15:58:25 2018
Rerunable = True
Resource_List.host = compute-0-3
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.select = 1:host=compute-0-3:ncpus=1
stime = Tue Dec 11 15:58:25 2018
session_id = 17184
jobdir = /home/user
substate = 92
Variable_List = PBS_O_HOME=/home/user,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=user-l host=compute-0-3:ncpus=10
,
PBS_O_PATH=/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/mp
i/intel64/bin:/opt/apps/intel/compilers_and_libraries_2017.4.196/linux/
bin/intel64:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibut
ils/bin:/opt/pbs/bin:/home/user/.local/bin:/home/user/bin,
PBS_O_MAIL=/var/spool/mail/user,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/user,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
PBS_O_HOST=login.local
comment = Job run at Tue Dec 11 at 15:58 on (compute-0-3.local:ncpus=1) and
finished
etime = Tue Dec 11 15:58:25 2018
run_count = 1
Stageout_status = 1
Exit_status = 0
Submit_arguments = -l host=compute-0-3 – /bin/sleep 100
executable = jsdl-hpcpa:Executable/bin/sleep</jsdl-hpcpa:Executable>
argument_list = jsdl-hpcpa:Argument100</jsdl-hpcpa:Argument>
history_timestamp = 1544544006
project = _pbs_project_default
Thank you. It is all working now.
Do you still see any issues ?
But what is the syntax for selecting nodes in PBS?
I am using pbs pro version 17 and i get an error when i use this command below
#PBS -l host=compute-0-3:ncpus=10 -l mem=10GB
The error is
Illegal attribute or resource value Resource_List.select
I got it now . The correct syntax is
#PBS -l host=compute-0-3 -l ncpus=10 -l mem=2GB
Thank you very much adarsh.
I have detected an issue.
When i enter the host name in job submission script, the job is submitted to a different host( .
eg. if i select compute node 5 #PBS -l host=compute-0-5 -l ncpus=18 -l mem=32GB
The job gets submitted to a compute node which is free ( starting from 1, 2, or 3 or higher).
But when i select the node using the interactive option such as
qsub -I -l host=compute-0-5 -l ncpus=10 -l mem=2GB
This rather works
Submit a job to a particular host:
qsub -l select=1:ncpus=2:mem=32gb:host=compute-0-5 -- /bin/sleep 100
Submit a job and let PBS decide where to run:
qsub -l select=1:ncpus=2:mem=32gb -- /bin/sleep 100
To run 2 cpu jobs on two nodes requesting 16GB on each of the nodes
qsub -l select=2:ncpus=1:mem=16gb -l place=scatter -- /bin/sleep 100
To specifically run jobs on two specific hosts
qsub -l nodes=compute-0-3+compute-0-5 -- /bin/sleep 100
qsub -l select=1:ncpus=1:mem=10gb:host=compute-0-3+1:mem=10gb:host=compute-0-5 -- /bin/sleep 100
Please read the below admin guide section: 5.4.9 Job-wide vs. Chunk Resources
Hi Adash
I did a restart of our HPC system and now all jobs are in a queue.
I tried the following command ‘pdsh date’ and i got the values below
ompute-0-2: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-2: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-1: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-2: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-3: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-1: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-4: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-2: It is also possible that a host key has just been changed.
compute-0-2: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-3: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-4: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-1: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-2: SHA256:gFtWek5zWZJQCnNQwycLTUtZXiZudM5S9H+xkK860ok.
compute-0-3: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-4: It is also possible that a host key has just been changed.
compute-0-1: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-2: Please contact your system administrator.
compute-0-4: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: It is also possible that a host key has just been changed.
compute-0-3: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-2: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-1: The fingerprint for the ECDSA key sent by the remote host is
compute-0-4: SHA256:o+nbSQpNh7f7YyQwP3myiY9H0LtKhwHWLBTlkXfggkE.
compute-0-3: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-2: Offending ECDSA key in /root/.ssh/known_hosts:3
compute-0-1: SHA256:1XCBEBwL4CIsAB+XU1uGM8borPm6WR1p+V1isuiNRFE.
compute-0-4: Please contact your system administrator.
compute-0-2: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: It is also possible that a host key has just been changed.
compute-0-1: Please contact your system administrator.
compute-0-4: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-2: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: The fingerprint for the ECDSA key sent by the remote host is
compute-0-1: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-4: Offending ECDSA key in /root/.ssh/known_hosts:5
compute-0-3: SHA256:mq67n0i6+BiZPVWWeHt5iORDaYXIiqsrhvzYQjr2YkQ.
compute-0-1: Offending ECDSA key in /root/.ssh/known_hosts:1
compute-0-4: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Please contact your system administrator.
compute-0-1: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-4: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-1: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Offending ECDSA key in /root/.ssh/known_hosts:4
compute-0-3: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-3: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-5: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-5: @ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
compute-0-5: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
compute-0-5: IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
compute-0-5: Someone could be eavesdropping on you right now (man-in-the-middle attack)!
compute-0-5: It is also possible that a host key has just been changed.
compute-0-5: The fingerprint for the ECDSA key sent by the remote host is
compute-0-5: SHA256:Ziq6yZmX0BRFHNwCD5U5425ef0B6ja4q+Q5h1Exs12U.
compute-0-5: Please contact your system administrator.
compute-0-5: Add correct host key in /root/.ssh/known_hosts to get rid of this message.
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6
compute-0-5: Password authentication is disabled to avoid man-in-the-middle attacks.
compute-0-5: Keyboard-interactive authentication is disabled to avoid man-in-the-middle attacks.
compute-0-2: Fri Jan 24 17:38:38 UTC 2020
compute-0-3: Fri Jan 24 17:38:38 UTC 2020
compute-0-1: Fri Jan 24 17:38:39 UTC 2020
compute-0-4: Fri Jan 24 17:38:37 UTC 2020
compute-0-5: Fri Jan 24 17:38:38 UTC 2020
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host
pdsh@master: compute-0-6: ssh exited with exit code 255
I see two separate problems. One has to do with your ssh configuration…
You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.
The second problem is this…
This indicates there is something wrong with your network configuration.
Hi Michael,
What do you suggest I do? Everything was working perfectly until I did a restart of the server. By the way we have two master nodes and one of them works as a fail over( depends on which is available or not)
Vincent Appiah
January 24
I see two separate problems. One has to do with your ssh configuration…
vincent718:
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6
You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.
The second problem is this…
vincent718:
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host
This indicates there is something wrong with your network configuration.
Hello Michael,
Before I performed the restart, i used qdel $(select) to delete all existing jobs. This was done using the root account. Could that be the cause?
Hi Michael,
What do you suggest I do? Everything was working perfectly until I did a restart of the server. By the way we have two master nodes and one of them works as a fail over( depends on which is available or not)
Vincent Appiah
January 24
I see two separate problems. One has to do with your ssh configuration…
vincent718:
compute-0-5: Offending ECDSA key in /root/.ssh/known_hosts:6
You should not run jobs as the root user. Also, it appears something has changed (IP address?) to invalidate the existing ssh keys.
The second problem is this…
vincent718:
compute-0-6: ssh: connect to host compute-0-6.local port 22: No route to host
This indicates there is something wrong with your network configuration.