New PBS Nodes failing to register

I previously had OpenPBS running fine in a cloud environment, but I’m not unable to get new nodes to register. Seems like the Scheduler isn’t learning about the new nodes at all.

On the compute node we don’t see anything suspicious in the mom_logs.

On the scheduler:

[root@ip-10-102-101-252 comm_logs]# qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1062.ip-10-102-1* hello-world      my-user                 0 Q test
[root@ip-10-102-101-252 comm_logs]# cat /etc/pbs.conf
PBS_SERVER=ip-10-102-101-252
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
[root@ip-10-102-101-252 comm_logs]# pbsnodes -av
pbsnodes: Server has no node list

Nothing on the Scheduler looks obviously suspicious in the server_logs, mom_logs or comm_logs.

In the server logs we see requests like this come in (from compute nodes):

11/30/2023 22:32:58;0100;Server@ip-10-102-101-252;Req;;Type 19 request received from user@ip-10-102-101-133.cloud.domain, sock=16

I’ve been able to manually create the node with a command like this:

qmgr -c "create node ip-10-102-101-133.cloud.domain"

But previously, this would have been done automatically.

Note that we have a setup where the FQDN is the same as the canonical name, i.e. hostname will return something like ip-10-102-101-133.cloud.domain. This setup has worked for us in the past.

Are there other place to look to investigate why the compute nodes don’t seem to register to the Scheduler on their own?

The node should be manually added to the PBS Server using
qmgr -c “create node nodename”

The nodes are not automatically added to the PBS Server. It is a manual process.
Usually if you are using cloud , there might be some of scripts (from the cloud vendor or custom script) that adds the node once they are online & ready in the cloud enviroment.

1 Like

Thank you! This was a key piece that I hadn’t fully appreciated.

It turns out that our script that was invoking qmgr had been broken but I didn’t notice it previously.

1 Like