I have installed OpenPBS on a workstation with Ubuntu 20.4.
Everything works fine until I send a job on a queue. The status in the queue is “R” but the time is fixed on 0:00 and nothing seems to happen. If I delete the job, an error file is generated with the following:
Host key verification failed.^M [mpiexec@Precision-7920-Tower] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor) [mpiexec@Precision-7920-Tower] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to ** proxy [mpiexec@Precision-7920-Tower] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream [mpiexec@Precision-7920-Tower] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error
** status [mpiexec@Precision-7920-Tower] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for even t [mpiexec@Precision-7920-Tower] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
I found similar posts but noone seems to solve my issue. From what I got it seems the problem is in the hostname defined in /etc/hosts and /etc/pbs.conf.
In my case:
/etc/hosts
As I mentioned before, I tried changing the server name in /etc/hosts to match the one in /etc/pbs.conf
After I put
127.0.1.1 Precision-7920-Tower
in /etc/hosts
I cannot execute any pbs command:
sysadmin@Precision-7920-Tower:~/testMDX/test1$ qstat -q
Connection refused
qstat: cannot connect to server Precision-7920-Tower (errno=15010)
sysadmin@Precision-7920-Tower:~/testMDX/test1$ qsub opbs.job
Connection refused
qsub: cannot connect to server Precisi (errno=15010)
Take note that these command works perfectly with the mismatched server name.
I also tried to restart the pbs services but it does not seem to make any difference.
Also:
sysadmin@Precision-7920-Tower:~/testMDX/test1$ /etc/init.d/pbs status
pbs_server is not running
pbs_mom is pid 8278
pbs_sched is not running
pbs_comm is not running
Not sure about the DNS thing, but I found this:
sysadmin@Precision-7920-Tower:~/testMDX/test1$ host Precision-7920-Tower
Precision-7920-Tower has address 10.80.135.53
sysadmin@Precision-7920-Tower:~/testMDX/test1$ host 10.80.135.53
53.135.80.10.in-addr.arpa domain name pointer Precision-7920-Tower.
I think I did not source the pbs.conf file earlier so now some of the results are different.
hostname and hostname -A give the same result:
Precision-7920-Tower
sysadmin@Precision-7920-Tower:~/testMDX/test1$ ping $PBS_SERVER
PING Precision-7920-Tower (10.80.135.53) 56(84) bytes of data.
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=1 ttl=64 time=0.030 ms
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=2 ttl=64 time=0.032 ms
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=3 ttl=64 time=0.033 ms
64 bytes from Precision-7920-Tower (10.80.135.53): icmp_seq=4 ttl=64 time=0.032 ms
…
Since apparently now I have the address 10.80.135.53 for the $PBS_SERVER (which is the same address of the workstation - I have to use this with ssh to connect), I changed the /etc/hosts file (I left the previous line 127.0.1.1 as comment):
#The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
So now we have a match between /etc/hosts and /etc/pbs.conf.
However this does not solve the issue I wrote the post about: My job is still hang on R status, the clock does not change and if I delete the job I have the error file with:
Host key verification failed.^M
[mpiexec@Precision-7920-Tower] HYDU_sock_write (utils/sock/sock.c:256): write error (Bad file descriptor)
[mpiexec@Precision-7920-Tower] HYD_pmcd_pmiserv_send_signal (pm/pmiserv/pmiserv_cb.c:178): unable to write data to
proxy
[mpiexec@Precision-7920-Tower] ui_cmd_cb (pm/pmiserv/pmiserv_pmci.c:77): unable to send signal downstream
[mpiexec@Precision-7920-Tower] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:77): callback returned error
status
[mpiexec@Precision-7920-Tower] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:196): error waiting for even
t
[mpiexec@Precision-7920-Tower] main (ui/mpich/mpiexec.c:336): process manager error waiting for completion
Sorry for the confusion. Let me know if you need more information and thanks again for the patience and the help.
edit /etc/ssh/ssh_config and add the below line to that file
Whenever you are running a MPI job using multipe nodes, the data should be accessible by all the nodes and the user running have job should have sufficient access to that data from each of the participating nodes
Make sure you have passwordless-ssh for the user is working seamless across the nodes (server to nodes, nodes to server and in between nodes) of the cluster. If this does not work, then the MPI jobs would not run.
StrictHostKeyChecking no
After making the /etc/hosts update, were the pbs services running
/etc/init.d/pbs status
ps -ef | grep pbs_
If they are running, then as a standard user can you submit a simple job and the following commands:
I am running openpbs on a single machine, a workstation with 20 cores. I access to the workstation by remote through ssh, but then I send the job with qsub from inside the machine. There are no external nodes involved. That is also why i do not understand why there is an host key error, since I am sending the job to the machine from the machine itself.
Precision-7920-Tower:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
24.Precision-7920-Tower sysadmin workq STDIN 10554 1 1 – – R 00:00 precision-7920-tower/0
Job run at Thu Jan 20 at 23:08 on (Precision-7920-Tower:ncpus=1)
sysadmin@Precision-7920-Tower:~$ pbsnodes -aSjv
mem ncpus nmics ngpus
vnode state njobs run susp f/t f/t f/t f/t jobs
Serial jobs run without problems. I can execute jobs and put more jobs in the queue. So I think the access to the server and the queue are fine.
The issue I reported happen whenever I try to run a job in parallel. In fact, the error specifically refer to a problem with mpiexec.
I think the issue is related to the assigned resources to the server and the queue. In fact if I do:
I tried to increase the assigned resources but without success. Could this be the problem?
P.S.
I tried to change the assigned ncpus with qmgr but to no avail:
Qmgr: set server Precision-7920-Tower resources_assigned.mpiprocs = 20
qmgr obj=Precision-7920-Tower svr=Precision-7920-Tower: Cannot set attribute, read only or insufficient permission resources_assigned.mpiprocs
I also tryed the same to the node:
Qmgr: set node precision-7920-tower resources_assigned.mpiprocs = 20
qmgr obj=precision-7920-tower svr=default: Cannot set attribute, read only or insufficient permission resources_assigned.mpiprocs
(My node is precision-7920-tower, like the server but lower case). It has to be noted that I executed qmgr with the sudo command. I managed to change other attributes but not the assigned_resources.
Ignore my PS. I realized that these values change whenever the job is sent. In this case once I sent a job with 20 cpus, the resources_assigned.mpiprocs has been changed (correctly) to 20.
The issue still remains that parallel jobs are not executed and once I kill the jobe I receive those messages on mpiexec.
Please check whether the passwordless SSH between server2node , node2node , node2server for respective users, it should work seamlessly without asking for password
Please try to run mpirun with a populated hostfile without using openpbs, if this runs successfuly, then it would run fine using openpbs as well.
Thanks for the reply.
I am still confused about the passwordless thing. I am using openPBS on a single machine. I am not using external nodes or servers. I am connecting to the machine with SSH and password, but then I am sending the job using openPBS on the machine from the machine itself.
Maybe I am missing something but I do not need to do ssh on the server or different nodes. I only use SSH to access the workstation remotely.
Regarding the second point, I am able to run mpirun with all the available cores (20) without problems.
mpi uses either rsh or ssh as launchers , if your MPI is compiled with openpbs TM libraries, then PBS would take care of the internode communication . Openmpi support - #5 by adarsh