PP-824: Use cases for Cray power ramp rate limiting

Hello,

This topic is to inform community about a User Cases and Requirements document for Power Ramp Rate Limiting functionality on Cray systems via PBS which is now available at location:

https://pbspro.atlassian.net/wiki/spaces/PD/pages/53080870/PP-824%3A+Cray+-+Ramp+rate+limiting

Please review and post your comments here.

Thanks,
Ashwath

Thanks for the UCR Sam. I have few comments on the requirements.

  1. There shall be a switch to allow an admin to enable/disable ramp rate limiting.
    Q. Where do we want this switch? On server or on the node? On server we have power_provisioning, can this be used as main switch and node will have a new ramp rate limiting switch?

  2. There shall be a minimum C-state of C-1 set. A core’s power needs can go up and down but the core shall normally drop only to C-1
    Q. Should the admin be able to change this value? Where this shall be set? On server which means same value for entire cluster or on individual nodes?

  3. If PBS determines that a node/set of nodes will be idle for a time PBS shall step the node(s) down to C-6.
    Q. I presume the idle time is defined by admin here. Right? That is how long since job/reservation ran on that node which defines idle time.

  4. When PBS determines that a node/set of nodes will be needed for a job PBS shall step the node(s) back up to C-1.

  5. PBS shall step the nodes between the minimum C-state and C-6 in random sleep intervals.
    Q. Do we need a switch to have fixed interval or random interval sleeps?

Thanks,
Ashwath

  1. These are different types of capabilities. The power awareness features attempt to save power. The ramp rate limiting feature attempts to smooth out power changes at the expense of potentially using more power. I believe there should be a separate switch for ramp rate limiting on the server.

  2. I believe for the first release we should just stick to C-1 (see also limitations of KNL power states). C-1 was suggested by Cray during discussion of “simplified” approach.

  3. I’d pictured something similar to what we do for current on/off functionality.

4./5. I do not believe we need a switch, from the presentation it was made clear (to me) that though for Domain #1 changes “in-sync” steps made for prettier graphs for the presentation “random” steps were better for the overall goal of reducing the spikes in energy usage:

  • Domain #1: Below min power cap

    • Solution: C-State limiting
      • Performed before a real user application is running
      • Perhaps as early as system boot time
  • Domain #2: Between min power cap and pinned C0

    • Solution: C0 pinned and sliding power caps
      • Performed before a real user application is running
      • Note: KNL has no 2nd domain
  • Domain #3: Above pinned C0

    • Solution: C0 pinned, sliding power caps [, and artificial workload
      • A real user application may or may not be running

Domain #1 Solution

  • Simplest solution is stepping all nodes through in sync
    • Smother curves can be formed by
      • Stepping subsets of nodes through the same steps
    • Randomize sleep intervals between each step

Further comment: Power ramp rate limiting/band management is an all or nothing prospect. We’re trying to prevent power spikes on the system. Having to also set a node level switch to turn it on is redundant. I’d propose that there be a server switch that enables the feature everywhere. An enhancement might be that we add a flag on a node that says “don’t limit c-states on this node” but I don’t expect there to be much need for it. Perhaps include it in the design and get feedback from Cray.