Warewulf cluster connection to OpenPBS failure

Hi,
I have an issue, which is quite crucial for going further. I’ve tried to install OpenPBS on Warewulf and as I’ve dealt with installation, connected some nodes from cluster to OpenPBS server problem has occured. I cannot send any job to it because of an error showed in attachments.


titan-n1
Mom = titan-n1.ece.local
Port = 15002
pbs_version = 23.06.06
ntype = PBS
state = free
pcpus = 32
resources_available.arch = linux
resources_available.host = titan-n1

OpenPBS services running perfectly only while booting from PXE sometimes it has to be turned on manually. The system is : Rocky Linux 9.6 (Blue Onyx)

Best regards,
Grzegorz

To understand the above correctly

  • The compute nodes are booted up (PXE boot)
  • pbs_mom service is running / if this service is not up and running, you manually start the service
    [ or you can make sure pbs_mom services is started at the last once all the system servces or all up and running ]
  • pbsnodes -av # shows all nodes are connected to the PBS Server
  • now job is submitted [ whether this job runs fine or does this job fails ]

Could you please share the job script or did you try running simple test job like below

qsub -- /bin/hostname
qsub -- /bin/sleep 10

Make sure on the compute nodes /var does not have any permissions set that is affecting this.
In the above screenshot system copy ( cp or scp based on the configuration) failed, as the file did not exist.

Please check the ports are not blocked PBS Implementation on AWS - #4 by adarsh