If pbs_mom could not resolve the FQDN of the pbs_server, this issue will happen.
In that case, we can see below pbs_mom logs.
08/02/2018 02:50:34;0100;pbs_mom;Job;6.BLRLAP796;allowed brema to access window station and desktop, User brema passworded
08/02/2018 02:50:34;0001;pbs_mom;Svr;pbs_mom;No error (0) in finish_exec, cannot open qsub sock for 6.BLRLAP796
08/02/2018 02:50:34;0008;pbs_mom;Job;6.BLRLAP796;cannot open qsub sock for 6.BLRLAP796
08/02/2018 02:50:35;0100;pbs_mom;Job;6.BLRLAP796;task 00000001 cput= 0:00:00
08/02/2018 02:50:35;0008;pbs_mom;Job;6.BLRLAP796;kill_job
08/02/2018 02:50:35;0100;pbs_mom;Job;6.BLRLAP796;spark cput= 0:00:00 mem=0kb
08/02/2018 02:50:35;0100;pbs_mom;Job;6.BLRLAP796;Obit sent
08/02/2018 02:50:35;0100;pbs_mom;Req;;Type 6 request received from brema@192.168.10.1:15001, sock=1
08/02/2018 02:50:35;0080;pbs_mom;Job;6.BLRLAP796;delete job request received
08/02/2018 02:50:35;0008;pbs_mom;Job;6.BLRLAP796;kill_job
Interesting there isn’t a job directory or file related to this job on node0115:
[node0115 ~]# ls -ltr /cm/local/apps/pbspro-ce/var/spool/mom_priv/jobs/29679
ls: cannot access /cm/local/apps/pbspro-ce/var/spool/mom_priv/jobs/29679: No such file or directory
bremanandjk suggested “If pbs_mom could not resolve the FQDN of the pbs_server, this issue will happen.”
I had the same error with only interactive jobs failing and it turned out I had the wrong ip address for the login node in the /etc/hosts file on the head node/PBS server. Hence the FQDN lookup was incorrect.
I did find our login-node’s ip in /etc/hosts to be inconsistent with that of the ip listed in the PBS server’s /etc/host and the ip on the compute node’s /etc/host file.
I went ahead and corrected this error in the login node’s /etc/host file but still received the ‘apparently deleted error’
Maybe the pbs_mom on this compute node needs to be restarted? I’ll try that next and check over my /etc/hosts afresh…
If that does not work I’d suggest trying the following (which mimics essentially what PBS is doing to make the connection in the interactive job):
Submit an interactive job that we know will not run (give it a really high ncpus request so that it will remain queued).
While that job is queued, look at it in qstat -f and note the full value of PBS_O_HOST in the Variable_List attribute.
On the same host that qsub -I is waiting for the job to start, run netstat -anp | grep qsub | grep LISTEN, note the port number (after the “0.0.0.0:”).
Now log into node0115 as root and issue the command “telnet X Y”, where X is the PBS_O_HOST value from qstat -f and Y is the port number from netstat.
Do you get “Escape character is ‘^]’.”, or something else?
I work with @sijisaula and am assisting with this as well.
I followed your instructions and did not receive the standard telnet response of “Escape character is ‘^]’.” Instead, I got the following (leaving out the actual address):
Trying [$PBS_O_HOST address] …
telnet: connect to address [$PBS_O_HOST address]: No route to host
However, the same $PBS_O_HOST can be pinged successfully from node0115 (and other compute nodes).
I think I see where you’re going with this. The port that the qsub process listens on for each interactive job has to be open on the firewall? I opened that specific port while the job was still queued and telnet now returns the expected response.
However, it looks like for each new interactive job, qsub listens on a different random high-numbered port. If I were to create a general firewall rule, what range of ports should I include to guarantee that interactive qsub processes are always accepted?
This problem seems to be the same issue I currently have. The recommendation was to make sure a range of ports was open. Nothing has changed on our setup, other than a reboot, but interactive jobs stopped working.
Hi, in my case it was sufficiently to open ports just on submit node (headnode, PBS_O_HOST).
Port range on Centos 8 is 32768-60999 .
You may check it in Your Linux system by command: sudo sysctl net.ipv4.ip_local_port_range .
Regards!