External design document for PP-824: Cray - Ramp rate limiting

Hello,

Here is the design document for power ramp rate limiting feature for Cray platform:
https://pbspro.atlassian.net/wiki/spaces/PD/pages/70123530/External+Interface+Design+for+Ramp+Rate+Limiting.

Refer UCR here: https://pbspro.atlassian.net/wiki/spaces/PD/pages/53126214/Ramp+Rate+Limiting+Use+Cases+and+Requirements

Review and let me know how this edd looks.

Thanks,
Ashwath

Thanks for posting this. Regarding the node attribute “ND_ATR_last_busy_time”: the word “busy” has a specific meaning in PBS (“busy” and “job-busy” are well defined node states), and “busy” as it is used here does not always mean the same thing (assuming that this attribute gets updated when a node leaves state “job-exclusive” or “resv-exclusive”, for example). What about calling this “ND_ATR_last_used_time” instead?

Also, the current EDD says “This new node attribute will be updated with time stamp at the end of job or reservation”, but what if there is more than one job on a node? I realize that current Cray XC vnodes are exclusively allocated, but this attribute may be useful in other scenarios and it would be good to clarify in the EDD whether that means “updated when ANY JOB ends” or “updated when THE LAST JOB ON THE NODE ends, leaving the vnode completely empty” (same for reservations, of course).

The data format of this attribute should be specified as well (seconds since epoch?).

Finally, assuming this is indeed seconds since epoch, it would be useful to be able to use this value in a node_sort_key, but this is planned to be an attribute rather than a read only resource and so not available as a node_sort_key, correct?

From the perspective of jobs or reservation node being busy, I kept this name. I do not see much difference between busy and used here. This value gets updated every time when a job or reservation ends. At least in this feature the value here is just a factor to compute a value to decide if the node is idle or not. Do you have any example or use case where this name could cause confusion?

I have updated the EDD to say attribute gets updated for ANY job or reservation.
Value is stored as seconds since epoch but for readability I have made it to dispaly as MON DD YYYY HH:MM:SS at pbsnodes side.

Yes. If you feel such use case could be useful we can think of making this attribute a resource.

Thanks for the feedback. Let me know if you have more comments.

The case where I worry that busy vs. used might be confusing is that the new “last busy time” attribute will be updated even in cases where the node was never actually “busy” (nor job-busy", etc.). I still feel that used is less confusing.

Thanks for updating the format of the attribute, what you have is great.

As for making it a resource, I don’t actually think it SHOULD be a resource, an attribute is much more fitting, but it would be a good addition to node_sort_key (one which we have implemented using hooks and custom resources at customer sites in the past, though it does not scale well on large systems with lost of job turnover as the hook needed to contact the server to update the nodes resource). What about special casing it for node_sort_key just like we currently special case the node attribute “Priority” to be exposed as “sort_priority” in job_sort_key (see find_node_amount() in src/scheduler/sort.c and under “/* resource names for sorting special cases */” in src/scheduler/config.h)?

I am not purposefully trying to introduce scope creep here :slight_smile:, but I think it is worth discussing at the same time that we are introducing this new interface.

When a new vnode is added and has not yet run any jobs or reservations, what will the value of this time stamp be? I only mean when a vnode newly appears in the node list since I assume that this time stamp data is persistent between mom and server restarts. The obvious choice to me is the time at which the pbs_mom (or “first pbs_mom” in the case of a Cray system not using vnode_pool) first starts reporting the vnode to the server as “free”. I think the behavior should be explicitly stated in the EDD.

Finally, on this same interface, we should make sure that this new time stamp interface gets updated when vnodes are released early from a running job (recently implemented PP-339 and PP-647). The current wording is “end of any job”, which could be interpreted in 2 ways since now a job can “end” in an important sense on a sister node well before the job actually “ends” completely.

Questions / Comments on other parts of the EDD aside from interface 5:

Interface 8: the summary leads me to believe that this is to introduce an entirely new PBS_ hook, not adding a new event trigger to an existing hook. Even though I can understand that the fact that it is using the same existing script (with added scripting, of course) may be am implementation detail, I still find it confusing. What about “Interface 8: Server periodic action added to pbshook PBS_Power” for a summary?

Interface 9: I think this interface is trying to do 2 things: introduce a new hook event (power_provisioning) AND say that the existing PBS_power hook will be modified to utilize this new event. If that understanding is correct, then it needs to be broken up 2 interfaces, one similar to interface 8 that says “power_provisioning action added to pbshook PBS_Power”, and another that talks only about the new hook event.

On the topic of the above 2 points: is it proper to actually describe things in this way in the EDD? Isn’t it an implementation detail (internal design) that ramping the nodes up and down is accomplished with the PBS_power hook by adding new periodic actions to the hook? Saying something like “at most $SVR_ATR_max_ramprate_limit nodes will be ramped down every $freq seconds)” and “at most $SVR_ATR_max_ramprate_limit nodes will be ramped up in anticipation of use” (with more detail, of course) more accurately describes the external behavior that can be relied on.

Interface 9: The interface for the adding the new hook event itself needs to include more detail on what various PBS attributes/objects are available within the hook event, caveats of use, when it is triggered, etc. (like in the “PP-425 to PP-434” EDD for example).

Interface 10: line 3 in the table should say “Nodes are being ramped up”, not “down”.

I have addressed your comments. Please have another look at the document and let me know what do you feel about it.

Thank you ! I have a few more comments on the current version that were not addressed from comments above:

  1. Please add a new interface for exposing last_used_time as a node_sort_key.

  2. Please specify a value for last_used_time for a newly added vnode that has never had a job/reservation end (suggested value found in a comment above).

  3. For consistency and simplicity, please move the details from interface 9 into interface 7 (which follows the model of interface 6 that states which hook event is being utilized to achieve the functionality, but is not actually a separate interface). Interface 9 can then be removed.

Hello,

Thanks for the comments. Hope I have addressed all the comments this time.

Last_used_time gets updated only when ND_ATR_power_ramprate_enable is set for a new node. Not when they are created.

Thanks,
Ashwath

Also posted this comment on UCR discussion:
Power ramp rate limiting/band management is an all or nothing prospect. We’re trying to prevent power spikes on the system. Having to also set a node level switch to turn it on is redundant. I’d propose that there be a server switch that enables the feature everywhere. An enhancement might be that we add a flag on a node that says “don’t limit c-states on this node” but I don’t expect there to be much need for it. Perhaps include it in the design and get feedback from Cray.

"An enhancement might be that we add a flag on a node that says “don’t limit c-states on this node” but I don’t expect there to be much need for it. "

How above statement is different from having a switch at node level. My idea of having this switch to control to give admin control over deciding if he wants limit c-states on that node or not. What are your thoughts on this?

Reading your initial words I am leaning towards not having node level switch. It makes more sense to have control at system level than node level.

The difference is that there is nothing the admin would need to “enable” on each node. It would only be necessary as an exception, so much less of a burden on an admin. As you say it is just a “nice to have”, not a show stopper.

I have removed the interface pertaining to node level switch for having ramp rate limit. Please have a look.

Please send more comments on this EDD if you have any. If I do not get anymore comments before 14th November I will freeze this EDD consider it as stable.

FYI @smgoosen @jon

Thanks,
Ashwath.

Looks good overall. I have a few suggestions/questions.

In interface 6 you are creating a new node state called “sleeping”. I don’t believe that there are any other states use the “ing”. Would it make sense to say asleep or inactive?

In interface 7 you define rampup. Should this be changed to ramp-up to match the other states (i.e. job-busy, state-unknown)?

You have defined an interface for ramp-up but I don’t see one for ramp-down. Should this be added so that we can tell when a node is being ramped down?

In interface 8 you define a new hook event called power_provision. Is this required? Can we use a server periodic hook event for this?

In interface 9 you define some log messages. For section 2 what would the logs look like. For example would I see a single line in the server logs for at the log info level?

Blockquote
Cray: init
Cray: connect
Cray: ramping up the node
Job;power_ramp_up;launch: /opt/cray/capmc/default/bin/capmc set_sleep_state_limit --nids 24-25 --limit 0
Job;power_ramp_up;launch: finished
Cray: disconnect

If this is the case, I would suggest that the lines that have “Cray:” be at debug 3 level for all of your log messages in section 2 and 3.

In section 4 I would change the log messages to

power_ramp_limit: nodes to ramp up: <node_list>
power_ramp_limit: nodes to ramp down: <node_list>

Thank you for your comments. I addressed few of them and find my answer here on below mentioned ones.

My idea was to have the nodes to ramp-down periodically through server periodic hook and since this hook will be responsible for identifying right nodes and putting them to sleep we did not need another intermediate state. Original idea was to reduce load on the server. But as we discussed offline, while server tries to put these nodes to asleep there are chances that scheduler could schedule jobs on these and we might end up having a race condition. It appears to me we will need to share some work with server to avoid this situation where server will identify the nodes to ramp-down, mark them with an intermediate node state and periodic hook will do rest of the job.

About interface 8, we will need power_provision hook to ramp-up the nodes. This hook works similar to our provision hook just before job starts running. As we discussed earlier on Cray MoM doesn’t go down and its really the compute_nodes that we provision. So I guess we can consider having this work done by launch or prologue hook. This will avoid having another provisioning type hook.

Share your thoughts on this.

Thanks,
Ashwath

@ashwathraop, can we not use the generic provisioning hook to accomplish the work the power_provision hook is doing (ie ramp up)? Also, it seems like the ramp down should be quite possible via the server periodic hook.

I looked at the design doc, and here are my comments:

  • Drop the code specific “SVR_ATR_” and “ND_ATR_” prefixes, as this is an external interface
    document that will be seen by users.
  • Under interface 1, specify Python type, which in this case will be “bool”.
    These attributes can be read via the pbs.server() interface (e.g.
    s=pbs.server(); print s.power_ramprate_enable.
  • Under interface 2 and 3 are missing the PBS type and Python type, likely
    a Pbs “long” and a Python int.
  • Under interface 3 (ramprate_limit), what does it mean “to drop to C-6”?
  • Under interface 5 (last_used_time) specify the PBS and Python type.
  • Under interace 6 (asleep), the phrase " nodes up when required to run jobs or reservations." should be " nodes up when required to run jobs or for reservations. Also, when it says “A server periodic hook runs every $freq seconds and takes list of vnodes to power ramp down the nodes and marks them in new asleep node state.”, is such a periodic hook be provided as part of this RFE?
  • Now if you go with new hook event, under interface 7 (new power_provision hook event), what are the new hook parameters along the lines of pbs.event().<parameter_name>? Perhaps vnode as it says “Hook will have access to name of vnode to be provisioned.” How about a list of
    vnodes that are currently “asleep” or ramped up? Also, need to specify the
    Python event constant, likely pbs.HOOK_EVENT_POWER_PROVISION.
  • Be sure to include some examples.

Yeah. Allowing multiple provisioning hooks should avoid introduction of similar hook.

For ramp-down I am considering server periodic hook only. But we have to decide how and where we identify nodes to ramp-down. Will it be entirely done at the hook or do we need to identify them before sending it to the hook (may be at server or scheduler).

Hello Al, thank you for your suggestions. Please see my replies inline:

Let me know if you have more comments.

Looks good so far. Thanks for making the changes.

Hi Ashwath,

In Interface last_used_time , You are mentioning about power_ramprate_enable as a node attribute , But i think power_ramprate_enable is only available at server level . So you might want to change the behaviour definition of last_used_time for nodes.