PP-339 and PP-647:release vnodes early from running jobs

bhroam · June 23, 2017, 1:16am

I can only see bad things happening if someone released a node that was part of a cray reservation. PBS would think it was free and cray wouldn’t. PBS would try and use it again and cray would reject it. We’d end up with a lot of held jobs.

What should we do about it? I don’t know. I don’t think there are any good answers. The only one I see is what Lisa says. Look for “cray_” and document that if you want to use the node ramp down feature, don’t put “cray_” in any node’s vntype that you want to use the feature.

Bhroam

scc · June 23, 2017, 3:06pm

First, just an implementation idea as an alternative to checkfing for “cray_” in vntype (which is not now mandated): can we have the server issue the release nodes request to the job’s primary execution host and have any pbs_mom that is running with $alps_client set reject the request?

Interface 2 is silent on whether or not it is supported for jobs involving Cray XC nodes, and I think it needs to be explicit since it is wholly separate from interface 1 from a user/admin perspective (though of course the back end code is in common). I believe it would not work on such jobs at present since my understanding is that the ALPS reservation persists until after file stageout completes (I may be wrong about that, though, please feel free to set me straight if I am!).

In the future I think it may be worth investigating if we can continue to disallow incremental shrinking of a job while it is still running for jobs running on Cray XC systems (since we can’t modify the ALPS reservation) but allow the “release_nodes_on_stageout” feature to be used by destroying the ALPS reservation before file stageout (and also freeing the compute nodes in PBS of course). In my mind this may be possible to support since we know the job is done with the compute nodes by the time file stageout happens, whereas with interface 1 it can be called at any time in the job’s lifecycle. This would be out of scope for the current work in my opinion, though.

bayucan · June 23, 2017, 4:11pm

Your suggestion seemed reasonable.

bayucan · June 23, 2017, 4:20pm

This is possible although, will cray login nodes also have $alps_client set on the mom side or will only be for cray compute nodes? (@lisa-altair) think it’s best not to ramp down vnodes with vntype ‘cray_compute’, ‘cray_login’, or ‘cray_compile’, and perhaps any vnode in the future with the ‘cray_*’ vntype.

scc · June 23, 2017, 4:38pm

There is no pbs_mom running on hosts that get represented as “vntype=cray_compute”, those vnodes are represented and accessed through the pbs_moms running on the hosts that are represented by the vnodes with vntype=cray_login, where alps_client must be set. (And remember those vntype values CAN be changed arbitrarily, those are just the default values).

It is certainly possible (though as far as I know uncommon) for a cray_login type node to serve as the primary execution host for a job that does not actually use any cray_compute type nodes, where a node release would technically be possible but disallowed by this approach. If others see this as an unacceptable drawback please speak up.

An unstated benefit to this proposed approach is it keeps all of the Cray-specific code in the mom, which is (I believe) still the only place we have anything truly specific to this platform.

bayucan · June 23, 2017, 5:45pm

I see, ok this approach makes sense then. Should I explictly state in the EDD that for Cray X* series nodes, it will be those managed by pbs_mom with $alps_client set, or is that too implementation detail?

bayucan · June 23, 2017, 5:49pm

‘release_node_on_stageout’ will have the same restrictions as in pbs_release_nodes. Ok, I’ll make it explicit in the doc.

bayucan · June 23, 2017, 5:51pm

Interesting thought…

scc · June 23, 2017, 6:49pm

I’d like to hear from @lisa-altair and/or @bhroam (and maybe @mkaro) on their views of this implementation idea before we go much further, but as for how to state this in the EDD I’d like to be as precise as possible with something like “this interface is not supported for jobs which have a primary execution host running a pbs_mom with $alps_client set, and will return this error …”.

bhroam · June 23, 2017, 7:17pm

Interesting point. This would also work on a pbs_release_nodes -a as well (just not a partial release).

Cray’s are a funny beast. If a job doesn’t request a Cray mom node as their first chunk, they will be assigned one (but I don’t think it shows up as MS). While it is common practice that the login nodes are also the mom nodes, do they have to be? Could a site do something like put their mom on the sdb node? Should we be assuming the moms are on login nodes?

As for what the EDD says, I’d use language we use in the docs. I seem to remember the term ‘inventory mom’ used somewhere. I might be misremembering though.

Bhroam

mkaro · June 23, 2017, 7:19pm

It’s probably sufficient to state that Cray compute nodes are excluded without being more specific. I assume that MAMU nodes will be supported.

lisa-altair · June 23, 2017, 9:50pm

I’m okay with this.

This phrasing seems good to me.

bayucan · June 26, 2017, 7:00pm

@scc: pbs_release_nodes will not allow a node to be released that is managed by the mother superior, which is the primary execution host. There’s a specific error for this that is mentioned in the EDD:

EDD excerpt:
pbs_release_nodes will report an error if any of the nodes specified are managed by a mother superior mom.
Example:
% pbs_release_nodes -j 241 borg[0]
pbs_release_nodes: Can’t free ‘borg[0]’ since it’s on an MS host

I can just add something where nodes that are tied to Cray X* series systems are those managed by mom with $alps_client set, and it will return the appropriate, existing message about not allowing sister nodes tied to Cray X* sseries systems to be released.

existing EDD excerpt:
% pbs_release_node -j 253 cray_node
"pbs_release_nodes: not currently supported on Cray X* series nodes: <cray_node>"

scc · June 26, 2017, 8:24pm

Thanks, Al. I am not sure if this is what you are implying with your reply or not, but the primary execution host name of the job will appear in the Mom = line of all of the X* execution hosts, so if that is the mechanism the code is using to determine whether or not a vnode is managed by a mother superior mom then it may already be rejected, no additional $alps_client checking required. Even if this is already the case the additional check in the code and more explicit detailing in the EDD may still be beneficial, though.

Also , while I re-read this, it stuck out at me that “since it’s on an MS host” is probably not what we want in the error message, instead I think we should use “primary execution host”. I think it is OK (though not ideal) to use the two terms interchangeably in the text of the EDD, but for actual log messages in the software I would very much prefer that we stick to “primary” and “secondary” rather than “MS” and “sister”.

bayucan · June 26, 2017, 9:40pm

scc:

Thanks, Al. I am not sure if this is what you are implying with your reply or not, but the primary execution host name of the job will appear in the Mom = line of all of the X* execution hosts, so if that is the mechanism the code is using to determine whether or not a vnode is managed by a mother superior mom then it may already be rejected, no additional $alps_client checking required. Even if this is already the case the additional check in the code and more explicit detailing in the EDD may still be beneficial, though.

Also , while I re-read this, it stuck out at me that “since it’s on an MS host” is probably not what we want in the error message, instead I think we should use “primary execution host”. I think it is OK (though not ideal) to use the two terms interchangeably in the text of the EDD, but for actual log messages in the software I would very much prefer that we stick to “primary” and “secondary” rather than “MS” and “sister”.

Scott, yes I’m implying MS host to be primary execution host, and the one appearing as “Mom” vnode attribute value.
I see in the admin guide that we interchangeably use “mother superior” and “primary execution host” as well as “secondary execution host” and “sister”, although, “secondary” is not really defined in our reference guide, and instead, we mention “subordinate mom”. I’ll keep the use of “mother superior” and “sister” in the EDD, but I’ll go ahead and change to “primary execution host” instead of “MS” in an error message to pbs_release_nodes when releasing a mother superior vnode.

Here’s the definitions in the PBS reference guide:

"Mother Superior
Mother Superior is the MoM on the head or first host of a multihost job. Mother
Superior controls the job, communicates with the server, and controls and consolidates
resource usage information. When a job is to run on more than one execution
host, the job is sent to the MoM on the primary execution host, which then starts the
job. Moved

Primary Execution Host
The execution host where a job’s top task runs, and where the MoM that manages the
job runs.

Sister
Any MoM that is not on the head or first host of a multihost job. A sister is directed
by the Mother Superior. Also called a subordinate MoM.

Subordinate MoM
Any MoM that is not on the head or first host of a multihost job. A subordinate
MoM is directed by the Mother Superior. Also called a sister.

bayucan · June 27, 2017, 8:10pm

You’re the second one who suggested not making ‘pbs_release_nodes -j ’ a required option, rather if it’s not given, it’s likely called inside a job where pbs_release_nodes can just get the jobid from $PBS_JOBID environment variable. Initially, I didn’t want to do this because pbs_release_nodes may not just be applying to running jobs but also with reservations via a new option later, say -r. But I’m getting convinced that we should allow what you suggested. Unless someone objects, I’ll go ahead and make the EDD change.

As mom holds onto the data for the job, it will also be saved on disk in the internal job files. So if MOM is restarted, it will just recover the data form the job file.

Yes, the trouble is in releasing individual vnodes managed by a cpuset mom. We can enhance pbs_release_nodes to work with cpuset-ed moms on the next release. It’s not targeted for this initial version.

Yes, I think the clause “until the entire cgroup is released for the job” should be added.

The information can be obtained from the server_logs much like how it works with other other PBS commands like qrun executed by non-root. This reminds me, I need to put in the EDD that if pbs_release_nodes fails with “Unauthorized User”, then server_logs would show the message like:
6/27/2017 15:13:45;0020;Server@corretja;Job;15.corretja;Unauthorized Request, request type: 90, Object: Job, Name: 15.corretja, request from: pbsuser@corretja.pbspro.com

Good point. I’ll replace “these nodes” with “node(s)” so it can be applied to both single nodes or multiple nodes lin the list on the error message.

As with the other cases, all vnodes specified in pbs_release_nodes must be releasable, but if one fails, like a Cray check, then none gets released.

Yes, that would be a nice option. We can add this enhancement on the next release of node ramp down feature.

Of course qstat will not be called automatically by pbs_release_nodes! It’s not meant to be implied that way.

The release vnode early request from pbs_release_nodes (i.e. IM_DELETE_JOB2) is different from a normal delete job request (i.e.qdel/IM_DELETE_JOB) as the former happens and the entire job has not ended yet, whereas the latter, the job is at the end. So the former, execjob_epilogue executes while the latter the execjob_end hook. So yes, sites will be made aware of this via our documentation.

I’m just being consistent with the other PBS api functions. All the ‘extend’ parameters are of “char *” type. Perhaps it will be an infrastructure project in PBS later to convert all the types to “void *”.

Ok, will fix this.

I’ve actually highlighted them in different colors blue, green, red…

No, it’s only when there’s a release nodes action. I’ll need to clarify that the new accounting records appear as a result of release node action.

Yes, this needs to be defined exactly in the EDD.

Refers to a PBS facility.

It supposed to be a period (.). I’ll fix.

I’ll change it to “cpu(s)” so it can be applied to both cases.

It’s a private, experimental interface, showing some internal attributes that may be added to later. I’ve actually listed what’s there so far, but it could get added to later.

altair4 · June 27, 2017, 9:37pm

It might be good to raise the issue of the two terms having very similar meanings with our documentation department. Seems like some explanation about usage context might help.

bayucan · July 4, 2017, 1:15am

In thinking about this some more and trying it code-wise, this goes a different route as to how node ramp down is currently implemented. The pbs release nodes request is sent to the server, and the server figures out which sister nodes are allowed to be released and does so, modifying the appropriate internal attributes and structures, and then telling the primary execution mom that this is how assigned vnodes look now and to go ahead and update your own internal tables/structures. It’s going to be a major change to move this logic now entirely on the primary mom side, and also many issues, subtleties arise in doing that.
So I’ll have to go back to Bhroam’s proposal to look for the “cray_” string in the vnode’s vntype value to determine that vnode is not allowed to be released. This will be added to the EDD to replace the note about “vnodes managed by mom with $alsp_client set in config file”.

sgombosi · July 5, 2017, 2:51am

Rather than have the releasability of a node special-cased for certain hard-coded vntypes, why shouldn’t releasability be a vnode attribute all by itself?

Topic		Replies	Views
PP-725: new "keep <select>" option for "pbs_release_nodes" Developers	19	1124	November 29, 2019
PP-928: Reliable Job Startup Developers	44	4015	September 20, 2018
PP-389: Allow the admin to suspend jobs for node maintenance Developers	35	5153	July 13, 2017
PP-734: Ability to release limited resources when a job is suspended Developers	57	4516	March 9, 2018
PP-586: On a Cray X-series, create a vnode per compute node Developers	40	4878	January 10, 2017

PP-339 and PP-647:release vnodes early from running jobs

Related topics