I previously had OpenPBS running fine in a cloud environment, but I’m not unable to get new nodes to register. Seems like the Scheduler isn’t learning about the new nodes at all.
On the compute node we don’t see anything suspicious in the mom_logs
.
On the scheduler:
[root@ip-10-102-101-252 comm_logs]# qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1062.ip-10-102-1* hello-world my-user 0 Q test
[root@ip-10-102-101-252 comm_logs]# cat /etc/pbs.conf
PBS_SERVER=ip-10-102-101-252
PBS_START_SERVER=1
PBS_START_SCHED=1
PBS_START_COMM=1
PBS_START_MOM=0
PBS_EXEC=/opt/pbs
PBS_HOME=/var/spool/pbs
PBS_CORE_LIMIT=unlimited
PBS_SCP=/usr/bin/scp
[root@ip-10-102-101-252 comm_logs]# pbsnodes -av
pbsnodes: Server has no node list
Nothing on the Scheduler looks obviously suspicious in the server_logs
, mom_logs
or comm_logs
.
In the server logs we see requests like this come in (from compute nodes):
11/30/2023 22:32:58;0100;Server@ip-10-102-101-252;Req;;Type 19 request received from user@ip-10-102-101-133.cloud.domain, sock=16
I’ve been able to manually create the node with a command like this:
qmgr -c "create node ip-10-102-101-133.cloud.domain"
But previously, this would have been done automatically.
Note that we have a setup where the FQDN is the same as the canonical name, i.e. hostname
will return something like ip-10-102-101-133.cloud.domain
. This setup has worked for us in the past.
Are there other place to look to investigate why the compute nodes don’t seem to register to the Scheduler on their own?