We are in the fortunate position of a group wanting to give us money for HPC equipment and staff. However, we are being asked how much “spare capacity” we have. I can get my head around disk capacity, but how do you quantify “spare capacity” for a HPC? We have wait times for jobs at busy periods, but other times zero queue lengths. What are the metrics we need to record?
You may ask why I don’t ask our HPC staff ! For six months I have been the “HPC staff” due to “budget reasons” (because “I know Linux”). I desperately want those new staff, and need to look like I know what I am doing. Once those staff are onboard, I am out of there and back to server support.
As per my understanding spare capacity relates to free resources within the hpc that can be given to some department or group of people or a project for period of time .
If you have long wait times and many jobs in the queue ( most of the time) then you do not have spare capacity. You would need to add additional resources (compute nodes, gpu nodes, disk space , networking etc) to suffice the demand of the jobs or bring down the wait time of the queued jobs or finish project on time .
What are the metrics we need to record?
You need to find out
used vs requested cores / memory / walltime / disk space
cores used and unused per node on a daily basis
whether you need all the resource of the hpc or you can turn them off to save energy
Other data that would lead to , whether you have spare capacity or you are under resourced hpc