Qstat : invalid credential => due to name resolution

I’ve been using OpenPBS 19.1.1 since a long time building Azure HPC solutions without issues. Recently we started to use Azure DNS Private DNS zone that our scheduler VM belongs to. This has introduced issues in the communication resulting in errors like “invalid credential” when running qstat or error when launching qmgr, whatever it is run from the scheduler VM or a login node.
Our setting is using shortname for the scheduler (named scheduler).
After some analysis it appears that due to Azure DNS Private DNS zone there are two records to resolve the IP address of the scheduler, and not always returned in the same order like shown below.

[root@scheduler server_logs]# nslookup 10.174.0.21
21.0.174.10.in-addr.arpa name = scheduler.internal.cloudapp.net.
21.0.174.10.in-addr.arpa name = scheduler.hpc.azure.

Authoritative answers can be found from:

[root@scheduler server_logs]# nslookup 10.174.0.21
21.0.174.10.in-addr.arpa name = scheduler.hpc.azure.
21.0.174.10.in-addr.arpa name = scheduler.internal.cloudapp.net.

Authoritative answers can be found from:
When looking at the server_logs it appears that some connections are made with the internal.cloudapp.net domain and others from hpc.azure domain which will be rejected as shown below.

12/28/2022 18:16:57;0040;Server@scheduler;Svr;scheduler.hpc.azure;Scheduler sent command 3
12/28/2022 18:16:57;0040;Server@scheduler;Svr;scheduler.hpc.azure;Scheduler sent command 0
12/28/2022 18:16:57;0100;Server@scheduler;Req;;Type 21 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:16:57;0100;Server@scheduler;Req;;Type 81 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:16:57;0100;Server@scheduler;Req;;Type 71 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:16:57;0100;Server@scheduler;Req;;Type 58 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:16:57;0080;Server@scheduler;Req;req_reject;Reject reply code=15064, aux=0, type=58, from Scheduler@scheduler.internal.cloudapp.net
12/28/2022 18:17:12;0040;Server@scheduler;Svr;scheduler.hpc.azure;Scheduler sent command 3
12/28/2022 18:17:12;0040;Server@scheduler;Svr;scheduler.hpc.azure;Scheduler sent command 0
12/28/2022 18:17:12;0100;Server@scheduler;Req;;Type 21 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:17:12;0100;Server@scheduler;Req;;Type 81 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:17:12;0100;Server@scheduler;Req;;Type 71 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:17:12;0100;Server@scheduler;Req;;Type 58 request received from Scheduler@scheduler.internal.cloudapp.net, sock=16
12/28/2022 18:17:12;0080;Server@scheduler;Req;req_reject;Reject reply code=15064, aux=0, type=58, from Scheduler@scheduler.internal.cloudapp.net

I’ve tried to use scheduler.hpc.azure in the pbs.conf without sucess.

How to always use shortname instead of FQDN in that case or force to use the same FQDN for all communications ?

Thank you
Xavier Pillons
Principal Technical Program Manager
Azure Specialized Workloads HPC/AI - Customer Solutions and Incubation
Microsoft Corporation

Please check the PBS Professional 2022.1 Administrator’s Guide on page number AG-422
Section : Table 9-1: Parameters in pbs.conf
Parameters: PBS_SERVER_HOST_NAME, PBS_LEAF_NAME

Thank you for the reply. I’ve now setup a new OpenPBS 22 version, but it’s sill unclear what I need to set in these variables.
my scheduler resolved it’s IP under two names :

The order is random. Here is what I’ve set and tried

  • PBS_SERVER_HOST_NAME=scheduler.internal.cloudapp.net, PBS_LEAF_NAME=scheduler.hpc.azure, PBS_SERVER=scheduler => KO
  • PBS_SERVER_HOST_NAME=scheduler.hpc.azure, PBS_LEAF_NAME=scheduler.internal.cloudapp.net, PBS_SERVER=scheduler => KO
  • PBS_SERVER_HOST_NAME=scheduler, PBS_LEAF_NAME=scheduler, PBS_SERVER=scheduler => Service is not starting at all
  • PBS_SERVER_HOST_NAME=scheduler, PBS_LEAF_NAME=scheduler.hpc.azure, PBS_SERVER=scheduler => Service is not starting at all

KO means that qstat ran from the scheduler will sometimes return “Invalid credential”
so it’s unclear how to solve this ? if solvable with PBS configuration settings.

Thank you @xpillons for the above information.

Please check man page of pbs_sched

   -c <clientsfile>
                Add clients to this scheduler's list of known clients.  The clientsfile contains single-line entries of the form
                    $clienthost <hostname>

                Each  hostname  is  added to the list of hosts allowed to connect to this scheduler.  If clientsfile cannot be opened, this scheduler aborts.  Path can be absolute or relative.  If rela‐
                tive, it is relative to PBS_HOME/sched_priv.
**Please try this:**
1.  create a  file    $PBS_HOME/sched_priv/myclientsfile on the PBS Server.
2. The contents of this file (myclientsfile) is as below
    $clienthost  scheduler.hpc.azure
    $clienthost  scheduler.internal.cloudapp.net
3. kill -9 pid_of_pbs_scheduler
4.  $PBS_EXEC/sbin/pbs_sched -c  $PBS_HOME/sched_priv/myclientsfile

Thank you @adarsh, I’ve tried to do what you have requested without success.
It look like the connection is rejected not by the scheduler but by the server process as I can see connection being rejected it in the server_logs as shown in the logs below. Here my scheduler VM is named openpbs. Same errors happen with a different local user too.

01/03/2023 12:55:38;0100;Server@openpbs;Req;;Type 0 request received from root@openpbs.internal.cloudapp.net, sock=18
01/03/2023 12:55:38;0100;Server@openpbs;Req;;Type 95 request received from root@openpbs.internal.cloudapp.net, sock=20
01/03/2023 12:55:38;0100;Server@openpbs;Req;;Type 98 request received from root@openpbs.internal.cloudapp.net, sock=18
01/03/2023 12:55:38;0100;Server@openpbs;Req;;Type 98 request received from root@openpbs.hpc.azure, sock=18
01/03/2023 12:55:38;0080;Server@openpbs;Req;req_reject;Reject reply code=15004, aux=0, type=98, from root@openpbs.hpc.azure
01/03/2023 12:55:38;0002;Server@openpbs;Sched;default;scheduler disconnected