Job sent on OpenPBS hangs in R status

Greetings,

I have installed OpenPBS on a workstation with Ubuntu 20.4.
Everything works fine until I send a job on a queue. The status in the queue is “R” but the time is fixed on 0:00 and nothing seems to happen. If I delete the job, an error file is generated with the following:

Host key verification failed.^M
[mpiexec@Precision-7920-Tower] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec@Precision-7920-Tower] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to **
proxy
[mpiexec@Precision-7920-Tower] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec@Precision-7920-Tower] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error
** status

[mpiexec@Precision-7920-Tower] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for even
t
[mpiexec@Precision-7920-Tower] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

I found similar posts but noone seems to solve my issue. From what I got it seems the problem is in the hostname defined in /etc/hosts and /etc/pbs.conf.
In my case:
/etc/hosts

127.0.0.1 localhost
127.0.1.1 sysadmin-Precision-7920-Tower

The following lines are desirable for IPv6 capable hosts

::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

/etc/pbs.conf:

PBS_SERVER=Precision-7920-Tower
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=1
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp

I tried to change sysadmin-Precision-7920-Tower to Precision-7920-Tower in /etc/hosts but then I receive an error if I try to send a job.

Thanks in advance,

Giuliano

and

does not match

Please note :

  1. DNS
  2. hostname resolution / reverse resolution to the same IP address
    is important

As I mentioned before, I tried changing the server name in /etc/hosts to match the one in /etc/pbs.conf
After I put
127.0.1.1 Precision-7920-Tower
in /etc/hosts
I cannot execute any pbs command:

sysadmin@Precision-7920-Tower:~/testMDX/test1$ qstat -q
Connection refused
qstat: cannot connect to server Precision-7920-Tower (errno=15010)
sysadmin@Precision-7920-Tower:~/testMDX/test1$ qsub opbs.job
Connection refused
qsub: cannot connect to server Precisi (errno=15010)

Take note that these command works perfectly with the mismatched server name.
I also tried to restart the pbs services but it does not seem to make any difference.

Could you please elaborate on this?

Thank you very much!

  1. Could you please browse this document: https://resources.altair.com/resfile_web_path/file-en/PBSWorks_Whitepaper_PBSProfessional_VirtualTestCluster_Ubuntu_02202020.pdf

  2. State=unknown, down. PBSr20 on CentOS8.1 - #2 by adarsh

  3. In your setup what is the output of


      hostname
      hostname -A  , hostname -i  
      source /etc/pbs.conf
      pbs_hostn -v  $PBS_SERVER
      ping  $PBS_SERVER

Thanks for the assistance.

  1. I tried to click on the link but it just a blank screen with the message “Sorry, no access.”

  2. there is no selinux as far as I can tell and the ports are open:

sysadmin@Precision-7920-Tower:~/testMDX/test1$ sudo lsof -i -P -n | grep LISTEN
[sudo] password for sysadmin:
systemd-r 836 systemd-resolve 13u IPv4 37898 0t0 TCP 127.0.0.53:53 (LISTEN)
cupsd 866 root 6u IPv6 34676 0t0 TCP [::1]:631 (LISTEN)
cupsd 866 root 7u IPv4 34677 0t0 TCP 127.0.0.1:631 (LISTEN)
sshd 1023 root 3u IPv4 40175 0t0 TCP *:22 (LISTEN)
sshd 1023 root 4u IPv6 40177 0t0 TCP *:22 (LISTEN)
slurmctld 1059 slurm 4u IPv4 18135 0t0 TCP *:6817 (LISTEN)
postgres 1079 postgres 3u IPv4 23967 0t0 TCP 127.0.0.1:5432 (LISTEN)
sendmail- 2243 root 4u IPv4 37499 0t0 TCP 127.0.0.1:25 (LISTEN)
sendmail- 2243 root 5u IPv4 37500 0t0 TCP 127.0.0.1:587 (LISTEN)
sshd 7914 sysadmin 10u IPv6 76126 0t0 TCP [::1]:6010 (LISTEN)
sshd 7914 sysadmin 11u IPv4 76127 0t0 TCP 127.0.0.1:6010 (LISTEN)
pbs_comm 8268 root 13u IPv4 76161 0t0 TCP *:17001 (LISTEN)
pbs_mom 8278 root 5u IPv4 66390 0t0 TCP *:15002 (LISTEN)
pbs_mom 8278 root 6u IPv4 66391 0t0 TCP *:15003 (LISTEN)
postgres 8380 postgres 5u IPv4 70249 0t0 TCP *:15007 (LISTEN)
postgres 8380 postgres 6u IPv6 70250 0t0 TCP *:15007 (LISTEN)
pbs_serve 8415 root 7u IPv4 76164 0t0 TCP *:15001 (LISTEN)

pbs_mom services are running:
sysadmin@Precision-7920-Tower:~/testMDX/test1$ ps -ef | grep pbs_
root 8268 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 8278 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_mom
root 8290 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 8355 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 8414 8380 0 19:48 ? 00:00:00 postgres: postgres pbs_datastore 10.80.135.53(35966) idle
root 8415 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
sysadmin 8804 7916 0 20:06 pts/0 00:00:00 grep --color=auto pbs_

Also:
sysadmin@Precision-7920-Tower:~/testMDX/test1$ /etc/init.d/pbs status
pbs_server is not running
pbs_mom is pid 8278
pbs_sched is not running
pbs_comm is not running

Not sure about the DNS thing, but I found this:

sysadmin@Precision-7920-Tower:~/testMDX/test1$ host Precision-7920-Tower
Precision-7920-Tower has address 10.80.135.53
sysadmin@Precision-7920-Tower:~/testMDX/test1$ host 10.80.135.53
53.135.80.10.in-addr.arpa domain name pointer Precision-7920-Tower.

  1. I think I did not source the pbs.conf file earlier so now some of the results are different.

hostname and hostname -A give the same result:
Precision-7920-Tower

hostname -i gives the following:
10.80.135.53

sysadmin@Precision-7920-Tower:~/testMDX/test1$ pbs_hostn -v $PBS_SERVER
primary name: Precision-7920-Tower (from gethostbyname())
aliases: -none-
address length: 4 bytes
address: 10.80.135.53 (898060298 dec) name: Precision-7920-Tower
sysadmin@Precision-7920-Tower:~/testMDX/test1$

sysadmin@Precision-7920-Tower:~/testMDX/test1$ ping $PBS_SERVER
PING Precision-7920-Tower (10.80.135.53) 56(84) bytes of data.
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=1 ttl=64 time=0.030 ms
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=2 ttl=64 time=0.032 ms
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=3 ttl=64 time=0.033 ms
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=4 ttl=64 time=0.032 ms

Since apparently now I have the address 10.80.135.53 for the $PBS_SERVER (which is the same address of the workstation - I have to use this with ssh to connect), I changed the /etc/hosts file (I left the previous line 127.0.1.1 as comment):

sysadmin@Precision-7920-Tower:~/testMDX/test1$ cat /etc/hosts
127.0.0.1 localhost
#127.0.1.1 sysadmin-Precision-7920-Tower
10.80.135.53 Precision-7920-Tower

#The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

So now we have a match between /etc/hosts and /etc/pbs.conf.
However this does not solve the issue I wrote the post about: My job is still hang on R status, the clock does not change and if I delete the job I have the error file with:

Host key verification failed.^M
[mpiexec@Precision-7920-Tower] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec@Precision-7920-Tower] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to
proxy
[mpiexec@Precision-7920-Tower] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec@Precision-7920-Tower] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error
status
[mpiexec@Precision-7920-Tower] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for even
t
[mpiexec@Precision-7920-Tower] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion

Sorry for the confusion. Let me know if you need more information and thanks again for the patience and the help.

To avoid this issue:

  1. edit /etc/ssh/ssh_config and add the below line to that file

Whenever you are running a MPI job using multipe nodes, the data should be accessible by all the nodes and the user running have job should have sufficient access to that data from each of the participating nodes

Make sure you have passwordless-ssh for the user is working seamless across the nodes (server to nodes, nodes to server and in between nodes) of the cluster. If this does not work, then the MPI jobs would not run.

StrictHostKeyChecking no

After making the /etc/hosts update, were the pbs services running

/etc/init.d/pbs status 
 ps -ef | grep pbs_

If they are running, then as a standard user can you submit a simple job and the following commands:

 qsub -- /bin/sleep 1000
 qstat -answ1
 pbsnodes -aSjv

I am running openpbs on a single machine, a workstation with 20 cores. I access to the workstation by remote through ssh, but then I send the job with qsub from inside the machine. There are no external nodes involved. That is also why i do not understand why there is an host key error, since I am sending the job to the machine from the machine itself.

It seems only Mom is running:

sysadmin@Precision-7920-Tower:~$ /etc/init.d/pbs status
pbs_server is not running
pbs_mom is pid 8278
pbs_sched is not running
pbs_comm is not running
sysadmin@Precision-7920-Tower:~$ ps -ef | grep pbs_
root 8268 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_comm
root 8278 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_mom
root 8290 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_sched
root 8355 1 0 19:48 ? 00:00:01 /opt/pbs/sbin/pbs_ds_monitor monitor
postgres 8414 8380 0 19:48 ? 00:00:00 postgres: postgres pbs_datastore 10.80.135.53(35966) idle
root 8415 1 0 19:48 ? 00:00:00 /opt/pbs/sbin/pbs_server.bin
sysadmin 10541 10418 0 23:07 pts/0 00:00:00 grep --color=auto pbs_

sysadmin@Precision-7920-Tower:~$ qsub – /bin/sleep 1000
24.Precision-7920-Tower
sysadmin@Precision-7920-Tower:~$ qstat -answ1

Precision-7920-Tower:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


24.Precision-7920-Tower sysadmin workq STDIN 10554 1 1 – – R 00:00 precision-7920-tower/0
Job run at Thu Jan 20 at 23:08 on (Precision-7920-Tower:ncpus=1)
sysadmin@Precision-7920-Tower:~$ pbsnodes -aSjv
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs


precision-7920- Stale 0 0 0 93gb/93gb 20/20 0/0 0/0 –
Precision-7920- free 1 1 0 93gb/93gb 19/20 0/0 0/0 24

It looks like the job is running without problems

Ok here it is an update:

Serial jobs run without problems. I can execute jobs and put more jobs in the queue. So I think the access to the server and the queue are fine.
The issue I reported happen whenever I try to run a job in parallel. In fact, the error specifically refer to a problem with mpiexec.

I think the issue is related to the assigned resources to the server and the queue. In fact if I do:

sysadmin@Precision-7920-Tower:~/testMDX/test3$ qmgr
Max open servers: 49
Qmgr: list server Precision-7920-Tower
Server Precision-7920-Tower
server_state = Active
server_host = precision-7920-tower
scheduling = True
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
managers = root@Precision-7920-Tower
default_queue = workq
log_events = 511
mailer = /usr/sbin/sendmail
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_max.mpiprocs = 20
resources_max.ncpus = 20
resources_assigned.mpiprocs = 0
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
scheduler_iteration = 600
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:1000000 Avail_Local:1000000 Used:0 High_Use:0
pbs_version = 20.0.0
eligible_time_enable = False
max_concurrent_provision = 5
max_job_sequence_id = 9999999

The same goes for the queue:

Qmgr: list queue test
Queue test
queue_type = Execution
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
resources_max.mpiprocs = 20
resources_max.ncpus = 20
resources_max.walltime = 24:00:00
resources_default.ncpus = 1
resources_default.nodect = 1
resources_default.nodes = 1
resources_assigned.mpiprocs = 0
resources_assigned.ncpus = 0
resources_assigned.nodect = 0
enabled = True
started = True

I tried to increase the assigned resources but without success. Could this be the problem?

P.S.

I tried to change the assigned ncpus with qmgr but to no avail:
Qmgr: set server Precision-7920-Tower resources_assigned.mpiprocs = 20
qmgr obj=Precision-7920-Tower svr=Precision-7920-Tower: Cannot set attribute, read only or insufficient permission resources_assigned.mpiprocs

I also tryed the same to the node:
Qmgr: set node precision-7920-tower resources_assigned.mpiprocs = 20
qmgr obj=precision-7920-tower svr=default: Cannot set attribute, read only or insufficient permission resources_assigned.mpiprocs

(My node is precision-7920-tower, like the server but lower case). It has to be noted that I executed qmgr with the sudo command. I managed to change other attributes but not the assigned_resources.

Ignore my PS. I realized that these values change whenever the job is sent. In this case once I sent a job with 20 cpus, the resources_assigned.mpiprocs has been changed (correctly) to 20.
The issue still remains that parallel jobs are not executed and once I kill the jobe I receive those messages on mpiexec.

  1. Please check whether the passwordless SSH between server2node , node2node , node2server for respective users, it should work seamlessly without asking for password

  2. Please try to run mpirun with a populated hostfile without using openpbs, if this runs successfuly, then it would run fine using openpbs as well.

Thanks for the reply.
I am still confused about the passwordless thing. I am using openPBS on a single machine. I am not using external nodes or servers. I am connecting to the machine with SSH and password, but then I am sending the job using openPBS on the machine from the machine itself.
Maybe I am missing something but I do not need to do ssh on the server or different nodes. I only use SSH to access the workstation remotely.
Regarding the second point, I am able to run mpirun with all the available cores (20) without problems.

  • mpi uses either rsh or ssh as launchers , if your MPI is compiled with openpbs TM libraries, then PBS would take care of the internode communication .
    Openmpi support - #5 by adarsh

Please check this link c - MPI programs hanging up - Stack Overflow

Please share the batch script and env variables used .

The stack overflow link solved it. I can’t believe.
I used -launcher fork and now I can send jobs in parallel.

Thank you very much!

Very good ! Thank you