Hello Everyone,
I’m using OpenPBS as job scheduler in Azure HPC environment (AZHop)
We’ve recently experienced a problem of Autoscaling: our pbs lig clearly reported a certain usage over a weekend period but the compute nodes allocated were much more than those really used.
When I asked a reimbursement to Microsoft, after long analysis and discussion, they told me that the problem is due the OpenPBS known issues where the hooks may simply stop firing the allocated resource at some points and they suggested ti update at the latest Cyclecloud version.
Is there anyone that can explain me what is the well known issue?
OpenPBS does not scale by itself, there should be a middleware that is aware of the load (jobs in the cloudq or queue(s) associated with the cloud). and it then instantiate additional resources in the cloud. If you could share the overview of your setup and the workflow to cloud , the would help the community members to understand and share their experience.
I also use OpenPBS with Azure for HPC. OpenPBS and CycleCloud are combined together to create VM on PBS job submission.
You may want to check autoscale.json (might be located here :/opt/cycle/pbspro/autoscale.json). Check “idle_timeout” value. There should be a reasonable value in seconds ex: 300 for 5min. After the idle timeout the Node will shutdown and you will save money.