PP-706: Automatically create KNL specific information

Hi All,

Please review the EDD & Design documents created for the KNL(Intel: Knights Landing) compute node on Cray platform.
Here is the link to the documents on Confluence: https://pbspro.atlassian.net/wiki/spaces/PD/pages/68845606/PP-706+Automatically+create+KNL+specific+information.

Thanks for posting this.
Not very familiar with interface 2. What would happen if we populated PBScrayseg with the actual value of numa nodes? Do we not do this because the system interface does not provide it? Or is it for some other reason?

Looking at interface 3 I think it would wise to introduce a new node attribute instead of using current_aoe. When we reboot the system we are not changing the application operating environment (aoe). Can we use something more along the lines of infrastructure operation environment (ioe)? So long and short I would suggest that we use current_ioe instead of current_aoe.

There are two ways of creating nodes on a cray cluster.

  1. Create vnode for each segment of cray compute node(i.e when vnode_per_numa_node=true),
  2. Create vnode for each cray compute node(i.e when vnode_per_numa_node=false), resources of such vnode in pbs is cumulative of each segment resources in the compute node.

for eg. below node 28 has two segments.

<Node node_id="28" name="xxxx" architecture="XT" role="BATCH" state="UP">
       <SegmentArray>
        <Segment ordinal="0">
         <ComputeUnitArray>
          <ComputeUnit ordinal="0">
...
...
          </ComputeUnit>
         </ComputeUnitArray>
         <MemoryArray>
          <Memory type="OS" page_size_kb="4" page_count="4194304"/>
         </MemoryArray>
         <LabelArray/>
        </Segment>
        <Segment ordinal="1">
         <ComputeUnitArray>
          <ComputeUnit ordinal="0">
...
...
          </ComputeUnit>
         </ComputeUnitArray>
         <MemoryArray>
          <Memory type="OS" page_size_kb="4" page_count="4194304"/>
         </MemoryArray>
         <LabelArray/>
        </Segment>
       </SegmentArray>
</Node>

if vnode_per_numa_node on pbs is set to true, then PBS will create
two vnode from the above node- vnode_28_0 & vnode_28_1. If vnode_per_numa_node
is set to false then PBS will create a single vnode with name vnode_28

Since we provision KNL compute node to various memory mode, PBS will create only a single vnode even if there are multiple segments in the node and vnode_per_numa_node is set to
true. if vnode_per_numa_node=true, then created vnode name will be hostname_xx_0 which will have cumulative resources of each segment.

I like the idea of supporting new type of provisioining “ioe”. But i really feel that we should implement this as a separate rfe. We should ecompass more usecases for this feature,
such that it can be generic for both cray and non-cray platform. We also need to figure out how it goes with existing provisioning type eoe and aoe.
What are your thoughts?

I think it makes sense to add ioe in addition to aoe and eoe. But we would need to make them have an ordering when run. For example if a job required more than one of these to be changed at a time, we would first do the infrastructure changes, then aoe, and finally eoe.

I don’t see how ioe would be platform specific since eoe and aoe are not. It does not make sense to change the infrastructure using aoe. Just like it didn’t make sense to use aoe to change the energy state. Hence the addition of eoe. As for use cases other than the KNL, sites may want to reconfigure the network before starting the job. Or they may want to update the settings on a GPU, or change the number of GPUs that are attached to a system if they are using the network attached GPUs. All of these are examples of changing the infrastructure of the node and not the application operating environment. Now can we do this by using aoe, yes. But what happens, if sites decide they want to configure the infrastructure one way while using a different OS versions. This is the scenario that what I would like to avoid.

Anyways, these are my thought that I would like to be considered in the design. If you feel that the best thing to do is implement this as a separate RFE then I am fine with that. However, before you make that decision please consider how much additional work (dev, QA, documentation, deprecate feature, and training) would be required to do this later vs in the initial check in.

Thanks for your consideration of my input.

I have a few comments:
First off, please number your interfaces so it is easier to talk about them. I’ll number them in my comments starting with the first one being 1.

Interface 1: Are you sure you want to hardcode the name of a chip in vntype? Before knl, there was knc. I’m sure knl won’t be the current chip forever. Maybe use a more generic term.
Interface 2: I don’t think there is a problem here, but keep in mind that AOE provisioning does not support multivnoded machines. If vnode_per_numa is true, you’ll end up with multiple vnodes and won’t be able to provision your KNL node.
Interface 3: Don’t name this current_aoe. The value of current_aoe is just the current value. These values will be in resources_available.aoe on that node. When someone requests an aoe, the value of current_aoe will change. This is how provisioning works.
Interface 5: This is internal. You don’t need to name it in an external design. If you want to name it, don’t mark it stable.
Interface 6: Most of these log messages are not stable. Actually most log messages are not stable. Always make a conscious decision on every interface before you mark it stable. Think of the consumers of the interface and if they need it to get their job done. Stable interfaces are expensive to PBS. Every stable interface needs to be tested. If we want to change one, it requires a 1 year deprecation period. Do you really want to go through all that to change a message that malloc() failed? Please reevaluate all of the log messages and only mark the ones which need to be stable as stable.

As for the ioe suggestion, I’m not sure I’m fond of the proliferation of hardcoded Xoe resources. We should try and abstract the concept out rather than compounding the issue. If we really do like the idea of having N Xoe resources, then maybe we create a new type of custom resource that marks a resource as a provisioning resource. This means people could create aoe, eoe, ioe, etc, but it’s one single code path to PBS.

Bhroam

I agree with @bhroam that we should NOT be hardcoding the [code]name of the chip. Intel refers to these processors as Xeon Phi. There have been several releases Knight Ferry (KNF), Knights Corner (KNC), Knights Landing (KNL)… and I know of more “names” coming in the future. Perhaps you can consider using xeon_phi?

Please note that after Knights Landing is Knights Hill (KNH, Newsroom Home) and then Knights Mill (KNM, Newsroom Home).

I cannot disclose the NDA material on this open forum, but I would advise you to consider that the Xeon Phi MCDRAM High Bandwidth Memory (MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer’s Guide | Colfax Research), may have different attributes that it’s predecessor(s) may or may not have.

Good point. I like this idea as long as there is a way to specify the order of when each type is run

I agree that hardcoding knl doesn’t make sense, since a new processor architecture is frequently released in couple of years. How about we keep the vntype as cray_compute itself
but add another boolean node attribute for bootable host processor like bootable_processor=true/false. This will make it generic for both knl and non knl node and future bootable processor series.

This is required since the system query reports only the current aoe attribute of the node, and not all the aoe available on that processor. Hence we need to set it to what system query has in the response.

<Nodes role="interactive" state="up" speed="1200" numa_nodes="1" dies="1" compute_units="68" cpus_per_cu="4" page_size_kb="4" page_count="25165874"\
numa_cfg="quad" ="16584" hbm_cache_pct="100">
40-47
</Nodes>

Above xml is excerpt of system query. The numa_cfg attribute reports the current provisioned mode of the processor. Therefore I am setting current_aoe , instead of aoe. Aoe will be set by admin to all the possible values of numa_cfg.

Will update the EDD with above suggestion

This is the exact reason why I believe we should do this enhancement separately. I feel we need to revamp some part of provisioning infra , so as to be generic and can be extended easily.

Regards
Dilip

I agree with @dilip-krishnan on this. The provisioning infrastructure changes should be a separate RFE.

Making the same comment here that I did on the pull request: I’m sorry, I only now noticed that you mention PBScrayseg in your details (and EDD). Originally PBS would create a vnode per NUMA node and would set a value for PBScrayseg. However, since PP-586 merged PBS no longer creates a vnode per NUMA and no longer sets PBScrayseg by default (i.e. vnode_per_numa_node is not set by default).
This new feature should not set a value for PBScrayseg by default on the KNL vnodes. And the vnode names for the KNL vnodes should not include the _segmentnumber

Hi @lisa-altair,
I have modified the edd & source code changes in PR as per your suggestion.

Warm regards
Dilip

Thank you @dilip-krishnan. I also noticed in the discussion that you suggested renaming the new vntype “cray_compute_knl”, but I noticed your design document hasn’t changed. What is the new vntype going to be called?

Hi Lisa,
Yes it was just a suggestion, and I haven’t heard back on it from the reviewer. So kept the design unchanged.
This design change will not be possible now, and I would suggest that if we really want to rename the vntype. Let’s create another ticket and handle it separately.

Warm regards
Dilip

@dilip-krishnan, I do not know who you are identifying as a ‘reviewer’. I don’t believe I am your reviewer, because I was someone in the community that noticed a topic that was interesting and provided some comments on what I see in the community and with partners.

Glad you agree that hardcoding knl is NOT a good idea. Besides, Intel has said to reference the processor as Xeon Phi and to not expose the codename in any external interfaces.

I am not sure we need “bootable_processor” as an attribute for the node for describing KNL (or any other Xeon Phi). If I am not mistaken, the “aoe” attribute, which is a node-level job request, will make the distinction on which node can satisfy the job’s request. Correct?

IOW, if I ask for -l select=2:aoe=a2a_50, then the scheduler will identify the node(s) that have the aoe attribute and provision (if necessary). Right?

I know several sites already rely on the vntype attribute being cray_compute_knl, this is because of the current implementation in 13.0.40x. Those sites, with Xeon Phi nodes, that will be upgrading from 13.0.40x to 18.x will already have the vntype attribute of cray_compute_knl, and their scripts will continue to work. (Assuming we will NOT be re-writing the vntype on upgrades). The vntype is an attribute of a node, and it site can add multiple strings (string_array); so, the site could add a value of cray_compute_xeon_phi, cray_compute_green, etc.

IIRC, there is a PBS hook that checks for cray_compute_knl… So the use of cray_compute_knl will need to be reviewed to see implications.

Your opportunity to change the vntype value is NOW as you are moving the 13.0.40x features to the master branch of PBS Professional; so you can properly clean up and make the implementation more consistent with PBS designs.

I agree with @scott and @bhroam that we should not use “knl” as part of the PBS assigned vntype. What if you just use “cray_compute”? Due to current_aoe and hbmem a user can use those resources to give a particular specification when they want to use the new KNL vnodes. Thus I don’t believe a built-in boolean resource is necessary…

Hi @scott ,@lisa-altair,
I have made changes as per your suggestion, and renamed vntype for KNL node as cray_compute.

Warm regards
Dilip

@dilip-krishnan, have you identified who your reviewer is? I want to make sure they are aware so that this discussion can be driven to closure and you can wrap up your work. (Believe me… I know I get discouraged when it feels like something continues to drag on.)

Hi @scott,
As far as I understand, EDD is open to community for review and that there are no assigned reviewer. For me whoever goes through the EDD and have some suggestion to improve or correct something in EDD
is a reviewer. I might be wrong, please feel free to correct me.

Warm regards
Dilip

One thing to think about: if the vntype is simply cray_compute and we rely on aoe being requested to guide jobs that need Xeon PHI to those nodes, then jobs that do not require Xeon PHI (that simply request vntype=cray_compute but no aoe) may run on Xeon PHI nodes, unless the admin takes steps to prevent that from happening.

True @scc. Do you see that as something that the design should handle now?