PP-339 and PP-647:release vnodes early from running jobs

Hey Al,
Thanks for making the updates. My only comment is to do with the API call. I don’t think we have any other call that takes a space delimited format. I’d rather not add one if possible. I see two possible alternitives. qstat can take multiple job ids. It takes a comma delimited list. The second is to take the format given to qrun -H. It’s a + delimited list of vnodes.

What do you think?
Bhroam

I can do the plus (+) separated list of vnodes.

Thanks Al, the changes look good.

  1. In the event that a node is released from a job early, it is good that the EDD now states that the execjob_end event will be triggered, but it does not explicitly state that the execjob_end hook would never run on such a host, which is different than what the current documentation says about execjob_end hooks, and so I think needs to be specifically called out in the EDD.

  2. In interface 2 I do not see that this comment was addressed:

“this will do an equivalent of ‘pbs_release_nodes -a’ for releasing all the sister vnodes when…” I believe that should say “for releasing all the sister nodes when…”, in keeping with the -a functionality.

  1. There are 2 items both labeled as “Interface 5”, they both relate to accounting log changes, but I do not think the duplicate interface numbers are intentional?

Good point. I’ll add a note to say that in this case, the execjob_end hook will not execute.

I must have missed addressing this. I’ll update.

It should not be duplicated. I’ll fix.

You can take a look at v15 now of the EDD where I made the changes.

Thanks Al! I have no further comments, looks great to me.

Maybe I am missing something but I believe that we already use space delimited format. We use it for qstat, qdel, and pbsnodes cli. I would rather not use the “+” delimiter from the cli. Now if this is only for the internal API calls then please disregard my comment.

Ir’s only for the API call, Jon.

Bob Ciotti has chimed in on 2 additional types of action for pbs_release_nodes he’d like to see:

  1. Support specifying the number of nodes that should be kept.

  2. Support specifying a select statement that is a subset of the job in its current form.

These provide a more natural expression of what to keep in the job. The select statement, in particular, will allow users to not have to play games w/ parsing PBS_NODEFILE or job attributes to know which nodes they want to keep or toss.

Interesting…so something like
pbs_release_nodes -k <‘number_of_sister_nodes’>

which requests the release of <‘number of nodes’> - 1 sister nodes. I’ll have to check to make sure <‘number_of_nodes’> is not more than the actual number of nodes assigned to the job not including the mother superior node.

I’m confused with the ‘select’ part. Can you give me an example?

Suppose I get a job by asking for:

qsub -l select=15:model=abc:mpiprocs=5+4:model=abc:bigmem=true:mpiprocs=1+30:model=def:mpiprocs=32

Then I want to release most of the nodes, but keep 2 of the non-bigmem abc nodes, and 2 of the bigmem abc nodes. It’d be much easier for the user to provide a new select statement:

pbs_release_nodes -k select=2:model=abc:mpiprocs=5+2:model=abc:bigmem=true:mpiprocs=1

than to have to determine via PBS_NODEFILE/qstat/pbsnodes which nodes have to be individually specified to pare back the job to that smaller select statement description.

I see, now I understand. I can see how this is a good addition to release nodes feature.

Thanks Greg and Al, I have filed PP-724 and PP-725 to capture these new user stories. Do we all agree that these proposed capabilities should be looked at as possible future enhancements, rather than extending the work currently being done for PP-339 and PP-647?

It’s fine with me…

Let’s assume for now they are future enhancements, and we can clarify with Ciotti at the next Altair/NAS meeting.

I just updated the node ramp down design document. It’s now v18.You might want to go to “Page History” and compare v.15 against v.18 (current). It just contains some minor update to output of some examples, also the error message to doing pbs_release_nodes of a cpuset node or a Cray XC node is changed slightly to be more consistent with format in other commands. Finally, I’ve added a new private, experimental server attribute to allow showing in qstat -f the values of internal attributes created by PBS server in order to implement this feature. The option is just an aid in debugging PBS.

Node Ramp Down Design

Looking to the future, we are likely to want to support adding resources to a job (in addition to releasing resources). To extend PBS Pro to add resources to jobs (in the future), it would be valuable to design this feature (today) to be easily extendable (syntax and accounting) and to fit naturally with that future direction.

For the “add resources” CLI/API, the natural opposite of pbs_release_nodes would be pbs_add_nodes, but that seems very unnatural and un-PBS-y. A better choice would be to use the existing PBS resource request language (-l select=XYZ) to either:

  • add resources, e.g., pbs_add_resources -l select=XYZ …
  • or replace the job’s resources, e.g., pbs_resize_resources -l selectXYZ" …

The latter obviously could be used for either adding or releasing, and has the design advantages of:
(a) using the existing PBS syntax,
(b) having PBS handle the calculation of which chunks are on which nodes, and
© naturally supporting both node-level and job-wide resources (e.g., expensive software licenses).
(Note: “pbs_release_nodes -a” does have all the above design advantages.)

I particularly don’t like the idea of forcing the user to do work that could be automated by PBS, so (b) strikes me as particularly onerous. (Allowing node names as an option is good; having them as the only option is not so nice.)

Even more than the CLI/API, I feel it is really important to ensure the accounting records will handle the add case (in addition to the release case), as the whole PBS Pro ecosystem builds tools that depend on and consume the accounting records. Here I am not not sure whether it would work or not… would the proposed new u, c, and e records also work for adding/resizing in general? Would we need any changes in them to support adding resources?

Thanks!

I believe the new u, c, and e records could easily be extended when dealing with adding nodes. Each record can reflect the exact exec_vnode, exec_host, resources_used.* values consistent with the action, whether the nodes are added or nodes released. In regards to using pbs_resize_nodes as opposed to pbs_release_nodes, I’m not sure if we should change the command name now. I know we have ticket filed to eventually support -l select syntax in pbs_release_nodes, and maybe can just add to that ticket to also rename “pbs_release_nodes” to “pbs_resize_nodes” when we support that…

Hi @bayucan I’m sorry to be chiming in so late, but @bhroam asked me some questions and pointed me to a portion of your code, which lead me to your external design…and well, here I am with questions.

  1. Can you please add a link from your external design to this forum discussion?
  2. the design says "pbs_release_nodes is not currently supported with nodes/vnodes that are tied to Cray XC systems, as the ALPS reservation cannot be modified right now. "
    It isn’t really clear what “tied to Cray XC systems” means. But then in looking at the code it looks like you plan to check for CRAY_LOGIN and CRAY_COMPUTE. But Cray login nodes do not have an ALPS reservation…if a job had requested the resources on more than 1 login node, theoretically all but one login node could also be freed. I understand that you might not want to allow it anyway, but I thought I’d point it out so you could clarify.
  3. PBS can manage a combination of linux systems together with a Cray X* series system. Would you expect those linux systems to be able to participate in node ramp down?
    3a) If so, those linux systems can potentially be labeled by the admin with a vntype of “cray_compile” (in fact it’s an example in our guides) In which case @bhroam code suggestion of strstr for “cray” won’t work
    3b) If not, how will you prevent those associated linux systems from participating in node ramp down since their vntype won’t be set to either cray_login or cray_compute and might not be set to “cray” anything.
  4. PBS supports more than the XC Cray system. You should probably update the design and code to state “Cray X* series” in all the places it says XC.

I’ll do.

Thanks for clarifying this. I realized now that only cray_compute nodes are involved in alps reservation, but I think it’s still a good idea to not allow cray_login nodes to be ramped down. I just have to update the EDD to specifically say vntypes of “cray_login” and “cray_compute” are not allowed to be released early.

Yes.

That’s why I did not want to do strstr of “cray” in the first place.

I’ll update the EDD also to say “Cray X*series” instead of XC.

I understand that. However @bhroam still has a good point, we know that there will be another vntype for Cray X* series in the near future. I can easily see all of us missing the fact that the new vntype is not added to the checks of what is not allowed to do node ramp down, and who knows what would happen then…