Communication Failure

Hello Everyone,

I’ve been using OpenPBS for some weeks, and it was running fine. But I installed openPBS and set every service to 0 on /etc/pbs.conf in order to use it just as a job submitter and I keep getting this error. It happens with any pbs or q command that I do:

imagen

current config for the client machine:

[root@bcn-pbsClient ~]# cat /etc/pbs.conf
PBS_SERVER=bcn-pbsserver
PBS_START_SERVER=0
PBS_START_SCHED=0
PBS_START_COMM=0
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp

SELinux and firewalld are disabled:

[root@bcn-pbsClient ~]# getenforce
Permissive
[root@bcn-pbsClient ~]# systemctl status firewalld | grep inactive
Active: inactive (dead)

and I have the mapping on /etc/hosts:

[root@bcn-pbsClient ~]# cat /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.10.44.112 bcn-pbsclient
10.10.44.120 bcn-pbsserver

The logs on the server are:
12/15/2020 17:37:57;0080;Server@bcn-pbsserver;Req;req_reject;Reject reply code=15056, aux=0, type=0, from root@bcn-pbsclient

The other machines are working fine, but any other machine that I try to install and use for openPBS has the same Communication Failure issue.
Any idea of what could be happening?

Please check the version of the PBS clients , they should be all the same version.
Any version mis-match would give you this issue.

Please make sure SELinux is disabled on the PBS Server, if it is disabled now, reboot it.

Hello Adarsh,

All the machines have the same PBS version:

[root@bcn-pbsServer ~]# qstat --version
pbs_version = 20.0.0

Selinux is disabled in all the machines. I’ve rebooted them again just in case and the issue still persist.

Any other ideas or things to check?

Thank you very much for your help

Thank you @baugarcia

could you please share the /etc/pbs.conf of the bcn-pbsserver and the below output ?

  1. ping bcn-pbsserver from pbsClient
  2. ping bcn-pbsserver from bcn-pbsserver
  3. qstat -Bf (edit: run this command on the bcn-pbsserver)
  4. Please check this : Pbs_iff: error returned: 15031 / No Permission. / qstat: cannot connect to server host (errno+15007)

Hello Adarsh,

Names are different for pbsclient, but same issue

1:
[root@bcn-envsub openpbs]# ping bcn-pbsserver
PING bcn-pbsserver (10.10.44.120) 56(84) bytes of data.
64 bytes from bcn-pbsserver (10.10.44.120): icmp_seq=1 ttl=64 time=0.381 ms
64 bytes from bcn-pbsserver (10.10.44.120): icmp_seq=2 ttl=64 time=0.331 ms

2:
[root@bcn-pbsserver ~]# ping bcn-pbsserver
PING bcn-pbsserver (10.10.44.120) 56(84) bytes of data.
64 bytes from bcn-pbsserver (10.10.44.120): icmp_seq=1 ttl=64 time=0.024 ms
64 bytes from bcn-pbsserver (10.10.44.120): icmp_seq=2 ttl=64 time=0.038 ms

3:
Server: bcn-pbsserver
server_state = Active
server_host = bcn-pbsserver
scheduling = True
total_jobs = 6878
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:1 Exiting:0 Begun
:0
acl_roots = username@*
operators = username@*
default_queue = workq
log_events = 511
mail_from = adm
query_other_jobs = True
resources_default.ncpus = 1
default_chunk.ncpus = 1
resources_assigned.ncpus = 1
resources_assigned.nodect = 1
scheduler_iteration = 600
flatuid = True
FLicenses = 20000000
resv_enable = True
node_fail_requeue = 310
max_array_size = 10000
pbs_license_min = 0
pbs_license_max = 2147483647
pbs_license_linger_time = 31536000
license_count = Avail_Global:10000000 Avail_Local:10000000 Used:0 High_Use:
0
pbs_version = 20.0.0
eligible_time_enable = False
job_history_enable = True
max_concurrent_provision = 5
power_provisioning = False
max_job_sequence_id = 9999999

4: I’ve checked permissions and gave SUID seems to be in place.

Thank you imagen

Here in the snapshot, it says cannot connect to server bcn-pbs (errno=15031)
is the server hostname truncated here ?

I cannot think of anything here , might be a un-install and install of the pbspro-client-20.0*.rpm again might be helpful to check as a last resort (if in case you have not tried ).

Yeah, I noticed the truncated name… I am guessing it is supposed to be this way, as writing whatever or the IP address also gets truncated.

And yeah, I am in the process of setting it up again and everything works, so I kind of wanted to know where the problem came from in case I face this issue in the future.

Thank you very much for your help!

Sorry to bump this, but I got new machines in the old setup with the same openpbs build that was installed on them.
I am guessing there were changes in the repo but it still appears as the same version, so it was broken and everything appeared to be fine.

Just leaving this as FYI, in case someone else faces this issue.