PP-735 PBSPro Power Awareness

Hello,

This topic is to inform community about design document for Power Awareness functionality within PBS which is now available at location:
https://pbspro.atlassian.net/wiki/display/PD/PBSPro+Power+Awareness

Please review and post your comments here.

Thanks,
Ashwath

Hi Ashwath

Please find some query on EDD

On A. Interface changes

Interfac #3

  1. How frequently the resource_used.energy value will get updated in qstat -f output?
    Is there any dependency on hook for updation of resource_used.energy value ?

  2. Will unit also be visible in qstat -f output along with resource_used.energy value
    for ex in qstat -f output .
    resource_used.energy = 64.2
    or
    resource_used.energy = 64.2 kWh

  3. In case job get requeued will this new resource_used will get updated in accounting log in ‘R’ log?

Interface #5

  1. Could you please provide example , how scheduler can run a job requesting an eoe on vnodes with a current_eoe value that matches the job eoe?

Interface #15
Message you had given in example is look more like information rather than LOG_WARNING
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch: finished in 156 seconds

is obejective to give LOG_WARNING or LOG_INFO ?

Interface #48

  1. Fot any python exception will it log below messgae in log
    socket.error: [Errno 111] Connection refused?

C. User’s instructions

  1. Monitor power usage of a job.
    Use qstat to see the resources_used.energy value as the job runs.

Once the job will finished will it be seen in qstat -fx output , if job history is enable ?

Regards,
Lovely

On A. Interface changes

Interface #3

How frequently the resource_used.energy value will get updated in qstat -f output?
Is there any dependency on hook for updation of resource_used.energy value ?

Ans: On every periodic hook run, by default set to 300 seconds. Yes, value updated from hook itself.

Will unit also be visible in qstat -f output along with resource_used.energy value
for ex in qstat -f output .
resource_used.energy = 64.2
or
resource_used.energy = 64.2 kWh

Ans: Unit is not displayed in either qstat -f output or accounting logs.

In case job get requeued will this new resource_used will get updated in accounting log in 'R' log?

Ans: Yes

Interface #5

  1. Could you please provide example , how scheduler can run a job requesting an eoe on vnodes with a current_eoe value that matches the job eoe?
    Ans: Example is mentioned in User instructions.
    qsub -leoe=low -lncpus=20 lackadaisical.sh
    qsub -lselect=4:eoe=high:ncpus=8 zoomjob
    If current_eoe already has “low” profile PMI wont activate profile again if not it’ll activate requested profile and set it to current_eoe ofcourse only if it is part of available eoe list.

Interface #15
Message you had given in example is look more like information rather than LOG_WARNING
11/19/2014 17:24:18;0006;pbs_python;Hook;pbs_python; 21.bigcray;launch: finished in 156 seconds

is obejective to give LOG_WARNING or LOG_INFO ?
Ans: It is a warning. Launch makes a call to capmc and it shouldn’t wait that long.

Interface #48

  1. Fot any python exception will it log below messgae in log
    socket.error: [Errno 111] Connection refused?
    Ans: Not the same message every time. Hook will print whatever system or internal errors captured to the log. In the above example it is socket error with connection refused message.

C. User’s instructions

Monitor power usage of a job.
Use qstat to see the resources_used.energy value as the job runs. 

Once the job will finished will it be seen in qstat -fx output , if job history is enable ?
Ans: Yes

Thanks,
Ashwath

Hi Ashwath,

My initial set of comments -

General -

  1. The documents are lengthy, so I would suggest that we map the use cases with the requirements and interfaces, it will help in understanding the feature in a better way.

UCR -

  1. It is not mentioned if the power usage information will be seen in tracejob as well.
  2. As you mention that the power calculations can be assumed to be precise only when the node sharing is set to “exclusive”, shouldn’t this be a mandatory step if power provisioning is enabled? I would like to see that all nodes and vnodes are automatically set to be exclusively used by a job/reservation when power provisioning is enabled. What do you think?
  3. Name the “Use Cases and Requirements”.

External Design -

  1. A.1.d - All the APIs should have the information about the parameters. We mention the names of the parameters in the explanation, but not while providing the name of the API.
  2. A.1.d.i - We need not have the hosts parameter at all, as it can be derived from the job parameter.
  3. A.1.d.i.3,4 - Let us elaborate on “where it is appropriate”.
  4. A.1.d.iv - Why do we need the parameter if it can have only one value?
  5. A.3.d.iii - What will be the value of energy if the operations are not allowed on the vnodes used by a job?
  6. A.4.d.vi - rephrasing needed to make it clear.
  7. A.4.d.xii - “prempt” should be “preempt”.
  8. A.5.d.iii - “unsupported” should be “not allowed”.
  9. A.5.d.vi - Why do we unset the eoe?
  10. A.10.d.i - “will have a default of unset” should be “will be unset by default”.
  11. A.10.d.ii - “power_provisioning” should be “power_enable”.
  12. A.10.d.iii - “set True” should be “set to True” and “set False” should be “set to False”.
  13. B.1.d.i - is not clear
  14. B.1.e - should have an explanation similar to B.1.d.i
  15. B.1.g - What if the eoe values are not reported the second time as well?
  16. B.2.a - Why so?
  17. B.4.a - “set True” should be “set to True”.

I have not gone through the new logs thoroughly.

Thanks,
Prakash

Thank you Prakash for the comments. Please find the replies inline below.

_My initial set of comments - _

_General - _
1) The documents are lengthy, so I would suggest that we map the use cases with the requirements and interfaces, it will help in understanding the feature in a better way.

I have updated the document with a tractability matrix which has some mapping between UCR and interfaces.

UCR -
1) It is not mentioned if the power usage information will be seen in tracejob as well.
Since we update accounting and server logs, power usage will be visible in tracejob output.
2) As you mention that the power calculations can be assumed to be precise only when the node sharing is set to “exclusive”, shouldn’t this be a mandatory step if power provisioning is enabled? I would like to see that all nodes and vnodes are automatically set to be exclusively used by a job/reservation when power provisioning is enabled. What do you think?
In cray we have nodes are set to exclusive by default. I believe making this not mandatory will give more control to admins.
@smgoosen can you answer these questions?

3) Name the “Use Cases and Requirements”.
Fixed it.

_External Design - _
_1) A.1.d - All the APIs should have the information about the parameters. We mention the names of the parameters in the explanation, but not while providing the name of the API. _
Fixed it.

2) A.1.d.i - We need not have the hosts parameter at all, as it can be derived from the job parameter.
Fixed it.

3) A.1.d.i.3,4 - Let us elaborate on “where it is appropriate”.
Since we are working with tools and interfaces external to PBS on which PBS doesn’t have control, we cannot exactly say when and where the error can occur. Hence it is written “where it is appropriate”

4) A.1.d.iv - Why do we need the parameter if it can have only one value?
Query being generic request, if we have to extend the feature to have some other requests from vendor power interface, this is the best way to do it. Hence we have the argument used.

5) A.3.d.iii - What will be the value of energy if the operations are not allowed on the vnodes used by a job?
Job attribute energy wont be seen in that case.

6) A.4.d.vi - rephrasing needed to make it clear.
Done.

7) A.4.d.xii - “prempt” should be “preempt”.
Done.

8) A.5.d.iii - “unsupported” should be “not allowed”.
Done.

9) A.5.d.vi - Why do we unset the eoe?
Since PBS changed the node state and once the job finished it is good to reset the node back.

10) A.10.d.i - “will have a default of unset” should be “will be unset by default”.
Done.

11) A.10.d.ii - “power_provisioning” should be “power_enable”.
Done.

12) A.10.d.iii - “set True” should be “set to True” and “set False” should be “set to False”.
Done.

13) B.1.d.i - is not clear
The hooks check if power provisioning flags are enabled before doing any power related operations on the node. if the flags are disabled when a job is running, power profiles may not be deactivated or energy may not be updated.

14) B.1.e - should have an explanation similar to B.1.d.i
Done.

15) B.1.g - What if the eoe values are not reported the second time as well?
Checking MoM logs with debug level enabled for MoM and hook would be the good start.

16) B.2.a - Why so?
As per admin guide,
Prologue and Epilogue Limitations and Caveats
•The prologue cannot be used to modify the job environment or to change limits on the job.
•If any execjob_prologue hooks exist, they are run, and the prologue is not run.
•If any execjob_epilogue hooks exist, they are run, and the epilogue is not run.

17) B.4.a - “set True” should be “set to True”.
Done.

I have not gone through the new logs thoroughly.

Let me know if you have any more questions.

I still see the hosts parameter.

Shall we update the EDD to mention this?

Rest looks good.

Thanks,
Prakash

I am not sure why you see this. Please refresh the page and check agian?

Done.

Hi Ashwath ,

Changes done in v.22 looks good to me .

Regards,
Lovely

@ashwathraop To answer your question from 5/8 - I agree that making it non-mandatory gives more flexibility. There will likely come a time when the power utilities (e.g. capmc or pprs) will be able to capture per job power usage and so we will be able to supply accurate power usage even while sharing nodes.

@ashwathraop, the EDD looks good to me now. Thanks for making the changes.

Quite a complex enhancement – thanks for the detailed design. A few questions/suggestions (referring to v.22 of the External Design at https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=51024324):

  • Please update the name of the page on https://pbspro.atlassian.net/wiki/display/PD/Project+Documentation to start with the JIRA ID, e.g., PP-735 PBS Pro …

  • I suggest removing the name/prefix “pmi”, as it will likely be confusing in the near future – there is a separate, well-known project called PMIx (https://github.com/pmix), and PBS Pro is likely to start supporting PMIx in the future (see PP-316). Is it even needed since the module already has the name pbs.Power (e.g., pbs.Power.connect instead of pbs.Power.pmi_connect, …)?

  • I’m not sure how to say this best … the design has two parts: low-level implementation (info on the pbs.Power module that is used to implement power management in A.1 and A.11) and PBS Pro level design (the rest). I almost feel like it would be best to remove A.1 and A.11 (or put them in a section called “internal design”). Since PBS Pro is open source, the pbs.Power interface is certainly available for others to see and augment (as an internal interface)… I’m just not sure it’s worth the effort to “officially support” it as an external interface (with all the usual testing, docs, backward compatibility requirements, etc.).

  • For turning on/off this capability A.2 uses the name “power_provisioning” and A.10 uses the name “power_enable”. Why not use the same name, since it’s the same capability that is being turned on/off at the server/node level? Also, should “changeable by an administrator” be “manager” (in both places)?

Thanks!

Thank you for the comments @billnitzberg. I have updated the document with your suggestions and the name is prefixed with the Jira ID too.

I can understand about the possible confusion when PMIx is introduced. Removed pmi prefix from all routine names.

To expose how to add new vendor interfaces to power management is one of the requirement so I kept A.1 and A.11 as part of EDD. But as you said, with code open sourced it doesn’t seem to be necessary. I moved these two sections to bottom of the page with a new section named Internal design.

Done. Now both server and node has power_provisioning as the switch to on/off the feature on them.

Thanks,
Ashwath

Regarding power_provisioning…

Thanks @ashwathraop!

Oh! missed that one earlier. Fixed it now. Thanks.

1 Like

@smgoosen I have fixed SGI-HPE naming conventions now in latest version of the document (both UCR and EDD). Please have a look.

Thanks for updating the SGI/HPE naming. The doc looks good

Updated power edd to show changed hook order for pbshooks. Please review.

What is the reason to add the ability for an administrator to change PBS Hooks order?

The idea behind PBS hooks (versus site hooks) is to allow core developers to more efficiently implement and deploy core code, by leveraging Python and the plugin framework, versus forcing all core changes to be done at the C language level. Besides the ability to enable/disable PBS hooks, all other controls should use existing PBS Pro configuration mechanisms (e.g., qmgr settings for non-hook objects). The idea of having a PBS hook order is fine as an internal option (set at compile time or packaging time); the idea that it is an external interface that the administrator can change breaks the idea that PBS hooks are purely internal.

But, like all design choices… it could be a trade-off and depends on the goal. Can you provide more background on the goal and why exposing this as an admin option is desirable?

Thx!

Something on the same lines of Bill’s concern is that by providing the admin the ability modify the order of PBS hooks, we are restricting ourselves. Right now we can enforce that one hook runs before another. If we create two hooks and the output of one is the input to the other, we can make sure they run in the right order. If we allow the admins the ability to change the order of PBS hooks, we no longer can be sure those two hooks will run in the right order. It could make us write more complex hooks.

An example is the cray translate hook. It takes a job submitted in the old cray language and converts it to the new select/place language. This hook really should be run first because any other hook would require the resources to be translated first.

Bhroam

In terms of background, allowing PBS hooks’ order to be modifiable dates back to when hook infrastructure was updated in support of cgroups. At that time, during design discussions, it was agreed that we would let all PBS hook attributes to be settable, and the only actions that we thought should not be allowed are: modifying the content of the PBS hook, and being able to create or delete a PBS hook via qmgr. We didn’t really explore the side effect of being able to change individual hook attributes like ‘order’…