PP-339 and PP-647:release vnodes early from running jobs

bhroam · January 31, 2017, 11:42pm

Something occurred to me when reading the document again. Does it make sense to require a user to use the -j option if they are in their job script? The mom can easily enough figure out where the request is coming from, right? If you do require it, I’d add an example of how to use pbs_release_nodes from a job script. You need to use $PBS_JOBID to get the job id.

I don’t think you need the _orig attributes. All these values are derived from the Resource_List.select. On a qrerun, they can be derived the same way again. No reason to clutter up qstat -f if we don’t need to.

I was wondering your reasons for modifying the schedselect? It is originally used to place the job. Now that the job is running, it will not be used (except on a qrerun, but you put it back). It won’t be used when resuming a preempted job. The exec_vnode is used for that. If you do decide to continue to modify it, you need to take care when doing so. Each chunk in the schedselect maps one for one to the exec_vnode. When you remove a node out of the middle of the exec_vnode, you’ll need to remove the correct chunk from the middle of the schedselect.

bayucan · February 1, 2017, 12:59am

Yes, I do require it. Okay, I’ll give an example of pbs_release_nodes -j $PBS_JOBID

I needed some way to go back to the original Resource_List values, before the first pbs_release_nodes was ever called, so that I can put back the job in the original state on a qrerun. Otherwise, the Resource_List* values reflect the effect of pbs_release_nodes already.

The internal nodes table inside the mother superior is determined by the schedselect value. That was the main reason, and yes, I’ve taken care of making sure that if schedselect is modified, exec_vnode, exec_host are modified as well, otherwise, the job will not run properly. It’s actually a 1:1:1 among schedselect, exec_vnode, and exec_host.

billnitzberg · February 1, 2017, 1:01am

@bayucan – Nice!

A couple suggestions:

The API(s) corresponding to this command should be documented too – every command line should have an equivalent API (or series of APIs).
- Also, will there be an equivalent PBS Hook call?
Interface 3 (new u accounting record): Consider breaking up u into two separate records: one that is the “partial end” record and another that is the new “continuation record” – this would allow you to not invent all the new names (updated_exec_host, updated_exec_vnode, updated_Resource_List, and resources_used_incr), and might allow existing scripts to re-use their logic from (existing) Start and End records. (By the way, the design hides these new keywords deep inside an example – it would be better to call them out as explicit additions if they are kept).
Interface 4 (change in E accounting record semantics): changing the semantics of the E record is going to cause a lot of confusion for people who have older tools. I suggest leaving the E record as-is, and inventing a new "end of job record for the case when nodes are released. So, old accounting processing tools work without change, but do not “see” these new types of jobs. New tools work perfectly. (An alternative would be to craft the new E record semantics so that old tools give at least the usage data correctly, and also add a new additional record with the data since last release.)
Interface 5: Suggest hiding all these names if at all possible. Dumping this data into the logs is fine, but exposing all these new names (e.g., via qstat) is problematic (even with the warning).
What does qstat show while the job is running and at the various stages of the job, including in the qstat history? (My suggestion would be nothing different than an existing qstat for maximum backward compatibility, but that may not be possible.)
If the release nodes action takes some time, then it might be useful to have a new job substate
The design should be explicit about what happens to any processes, temp directories, etc., on nodes (vnodes) that have been released. (I assume they are killed/deleted, but I didn’t see that in the design.)

Again, nice!

bhroam · February 1, 2017, 1:28am

I think I may have been confusing. I didn’t mean to ask if the current design required the -j option. I meant to suggest that we change the design. My suggestion is that if the pbs_release_nodes command is invoked from within a job script, the -j option is not needed.

This brings up another question in my mind. Where does the command need to be run? Is it only from the mom? Can it be run anywhere like other PBS commands? If there is a limitation of where the command needs to be run, then that should be spelled out in the document.

Once again, I may have been confusing about what I was suggesting. My suggestion is to remove these attributes. They are storing values that are originally derived from the Resource_List.select. Why can’t we re-derive them when the job is requeued?

bayucan · February 1, 2017, 5:56pm

Yes, there should be an API equivalent to pbs_release_nodes. I’ll add to the EDD.

I like this idea. Ok, I’ll update the design.

This makes sense. Ok, I’ll update.

I agree. I’ll have to figure out how to block these internal attributes from showing up in qstat display. I’ll just remove interface 5 altogether.

It should be the same with the exception that at every successful pbs_release_nodes call, the exec_host, exec_vnode, Resource_List* will show the updated values sans the released vnodes in the assignment. I’ll make this info explicit.

I don’t think speed is an issue with pbs_release_nodes. But I’ll keep this in mind.

Ah yes, these should be spelled out. I’ll do.

Cool!

bayucan · February 1, 2017, 6:03pm

Ok, noted.

pbs_release_nodes can be run in the command line, outside of the job, or inside the job script or via qsub -I.

I’ll be doing some implementation adjustments and I’ll keep this in mind. I just want to make sure the regeneration process does not add more to server processing time. I will also try to figure out how to hide all the internal job attributes from showing upin qstat so as to not clutter the qstat display.

gmatthew · February 3, 2017, 10:53pm

So, which alternative will you pick, Al? If at all possible I’d like the design of end-of-job (and end-of-phase for that matter) to remain such that the log parser can be stateless - the parser can look at each record in isolation and know what to do without having to look at other records. I think Bill’s suggestion in parentheses might break this.

bayucan · February 4, 2017, 5:13pm

I was going to do the leaving the end record as is, and introduce a new “last phase” record, not the one suggested by Bill in parenthesis.

He commented that instead of having these new keywords resources_used_incr, next_exec_vnode, next_exec_host, next_Resource_List, it might be better to break up the accounting records, keeping the current keywords as is, which would mean differently based on record type.

So in summary, I see something like:

u accounting record that has assigned and used values for that phase of the job.
;u;exec_vnode=… exec_host=… Resource_List.=… resources_used.

And then introduce a “continue” phase record that shows the next assigned exec_vnode, exec_host, Resource_List. I’m not sure if adding resources_used* values here makes sense, unless you want a snapshot of the resources_used* values of the job at that stage of the job?

;c;exec_vnode=… exec_host=… Resource_List.*=…

Then the ‘E’ (end) record continue as before, showing the job’s values in total at the end of the job:
;E;exec_vnode=… exec_host=… Resource_List.=… resources_used.=…

But before the ‘E’ record, there’ll be the last phase record written, say the ‘e’ record that shows that assigned resources and used resources at the last phase ot the job:
;e;exec_vnode=… exec_host=… Resource_List.=… resources_used.=

–
Would this work for you? In regards to statelessness, I can foresee the log parser to just gather up all the ‘u’ and ‘e’ records of the job to derive the total values and either aggregate them or average them out.

gmatthew · February 16, 2017, 6:08pm

We had a dedicated time and unplanned power outage that robbed me of a week or so - thus the delay in responding.

I don’t have a need for resources_used* values here.

I don’t think this ends up being stateless in the presence of a mix of jobs where some jobs have nodes released and others don’t. The log parser has to know to pay (different) attention to the E record for jobs that did not have nodes released than for those jobs that did have nodes released.

bayucan · February 16, 2017, 6:26pm

Ok.

True. For jobs with released nodes (presence of ‘u’ records), the parser has to look look for the ‘e’ records, while jobs without released nodes (no ‘u’ records), then parser pays attention to the ‘E’ record.

scc · February 20, 2017, 1:27pm

I have a few questions/comments here:

Interface 1:

Can you give an example showing releasing some vnodes (but not all) from a single host?
Is releasing some but not all vnodes on a host supported on systems running the cpuset mom (will the cpuset be resized)?
Is releasing some but not all vnodes on a host supported on systems running the cgroup hook (will the cgroup be resized)?
Is there no option to release all vnodes aside from the first, analogous to releasing “all sister nodes” with -a?

Interface 2:

“this will do an equivalent of ‘pbs_release_nodes -a’ for releasing all the sister vnodes when…” I believe that should say “for releasing all the sister nodes when…”, in keeping with the -a functionality.
Do we need to explicitly specify whether or not this will work with default_qsub_arguments?
Do we need to explicitly specify the specifics of how/ when this new attribute will appear and be settable via hooks?

Interface 5:

Minor: I don’t understand the wording of this: “Taking from previous example, support there’s the following release of vnode:”

mkaro · February 20, 2017, 4:50pm

With regard to the cgroups hook, if the normal post-job hooks are run on the nodes being released then the associated cgroups should get cleaned up.

With a cpuset mom, I would expect the cgroups hook to be disabled. Or at least the cpuset subsystem should be disabled.

scc · February 20, 2017, 6:52pm

Hi @mkaro, thanks for the response, this brings up an important point that is missing from the EDD that I missed previously: the EDD currently appears to be silent on which (if any) end of job hook events get triggered on a node that is being complete released, and also which (if any) end of job hooks run on a multi-vnoded node that is being partially released.

If end of job hook events are triggered, do hook writers need a way to detect whether or not the node it is running on is being completely released from a running job, or if the job is completely finished so that they can alter their behavior?

Since many execjob_end/execjob_epilogue hooks are concerned with job cleanup I think something needs to run so that there is an opportunity for site specific extra cleanup can take place and we don’t introduce an inadvertent loophole that may leave nodes “dirty”.

Bak to the original question… Assuming for a minute that the necessary hook end of job hook events run do run on a node being partially released from a job, could the cgroups hook handle shrinking a cgroup to match the new exec_vnode for the node while job processes may still be running inside it? My guess is “not currently”. This is a case where the hook would need to know whether it is being run at true end of job for the node, or if the job is being shrunk on the node (and also it’d need to know what the new exec_vnode looks like). This is beginning to sound like a user story for a new “execjob_shrink/release/resize” hook event…

Any thoughts?

bayucan · February 21, 2017, 12:59am

I’ll use an example from my EDD where ‘federer[1]’ is just one vnode from host ‘federer’ being released, with the other 2 vnodes ‘federer[0]’ and ‘federer’ continue to be assigned after pbs_release_nodes call:

_% qstat 241 | grep “exec|Resource_List|select”
_exec_host = borg[0]/00+federer/00+lendl/0*2
exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576k:ncpus=1+federer[1]:ncpus=1)+(lendl:ncpus=2:mem=2097152kb)
Resource_List.mem = 6gb
Resource_List.ncpus = 8
Resource_List.nodect = 3
Resource_List.place = scatter
Resource_List.select = ncpus=3:mem=2gb+ncpus=3:mem=2gb+ncpus=2:mem=2gb
schedselect = 1:ncpus=3:mem=2gb+1:ncpus=3:mem=2gb+1:ncpus=2:mem=2gb

_% pbs_release_nodes -j 241 federer[1] lendl

_% qstat 241 | grep “exec|Resource_List|select”
exec_host = borg[0]/00+federer/00 ← no lendl as all assigned vnodes in lendl have been cleared.
exec_vnode = (borg[0]:mem=1048576kb:ncpus=1+borg[1]:mem=1048576kb:ncpus=1+borg[2]:ncpus=1)+(federer:mem=1048576kb:ncpus=1+federer[0]:mem=1048576kb:ncpus=1) ← federer[1] and lendl removed.
Resource_List.mem = 4194304kb ← minus 2gb (from lendl)
Resource_List.ncpus = 5 ← minus 3 cpus (1 from federer[1] and 2 from lendl)
Resource_List.nodect = 2 ← minus 1 chunk (when lendl was taken out, its entire chunk assignment disappeared)
Resource_List.place = scatter
schedselect = 1:mem=2097152kb:ncpus=3+1:mem=2097152kb:ncpus=2

The way it stands, it should not be supported. Currently, If one (or more) but not all the vnodes from a mom host assigned to the job did not got released from the job, then nothing happens (cpuset is not resized) except update the internal tables on the mom and server side, and allowing the server to reassign those vnodes to other jobs, if the vnodes have been configured as “shared”. Only if all the vnodes from that mom host assigned to the job are released, will the job be completely removed from that mom host, causing the cpuset created for the job to be cleared, and an epilogue hook is executed as well.
Perhaps on a later version, we can extend this feature to allow execution of a hook, say epilogue hook, whenever a vnode has been taken out of the job. Then that hook can take care of resizing the cpuset.

I’d say in the initial implementation, this should not be supported. If one (or more) but not all vnodes from a mom host assigned to the job did not got released from the job, then nothing happens (no resizing of cgroup) except update the internal tables on the mom and server side, and allowing the server to reassign those vnodes to other jobs, if the vnodes have been configured as “shared”. Only if all the vnodes from that mom host assigned to the job are released, will the job be completely removed from that mom host, causing the execution of an epilogue hook, which will clean up the cgroup hook.
Again, on a later version, we can extend this feature to allow execution of a hook, say epilogue hook, whenever a vnode has been taken out of the job. This hook can take care of resizing cgroups.

No, the ‘-a’ is the only other way that we have right now. The vnodes to be released would have to be named.

Ok.

Yes, this should work also with default_qsub_arguments. I’ll add this info.

Yes, that should be mentioned as well.

Let me improve the wording.

bayucan · February 21, 2017, 1:18am

I’ll add the info. If all vnodes are released assigned to the job from that mom host ,then an epilogue hook will run as part of the action of deleting the job completely from the mom host. No hooks currently get executed when vnodes are partially released. Only internal tables of the server and mom are updated, allowing those vnodes that are set to be “shared” to be assigned to other jobs.

When all vnodes get released, then only epilogue hook executes and not end hook. If we allow a hook to execute every time a vnode get released, then we might need to add this capability. Or maybe we can introduce a new kind of a hook event. I feel this should be on a future version of this feature, though.

scc:

Since many execjob_end/execjob_epilogue hooks are concerned with job cleanup I think something needs to run so that there is an opportunity for site specific extra cleanup can take place and we don’t introduce an inadvertent loophole that may leave nodes “dirty”.

Bak to the original question… Assuming for a minute that the necessary hook end of job hook events run do run on a node being partially released from a job, could the cgroups hook handle shrinking a cgroup to match the new exec_vnode for the node while job processes may still be running inside it? My guess is “not currently”. This is a case where the hook would need to know whether it is being run at true end of job for the node, or if the job is being shrunk on the node (and also it’d need to know what the new exec_vnode looks like). This is beginning to sound like a user story for a new “execjob_shrink/release/resize” hook event…

Any thoughts?

I feel like for the initial of this feature, we should not support cpusetted mom and cgroups yet, given the problem/issues with partial release of vnodes from a mom host., much like what we say about Cray:
“pbs_release_nodes is not currently supported with nodes/vnodes that are tied to Cray XC systems, as the ALPS reservation cannot be modified right now.”

bhroam · February 22, 2017, 1:35am

Even if we don’t support cpuset resizing (and maybe cgroups), I think it’ll provide a very nasty way for an user/admin to shoot themselves in the foot. If I am not mistaken, a cpuset is tied to specific node boards exclusively. If the server shows the new nodes as free, the scheduler will run a new job on them. When the new job gets to the mom, a new cpuset will not be able to be created and the job will be rejected. This will happen 20 times until it’s put on hold and the next job starts the same sequence of events.

I don’t know what will happen to cgroups. From what I understand about them, they’re not as tied to specific hardware like a cpuset is. We might end up leaving them around if the correct hook events don’t get called though.

I have no problem with restrictions on the initial implementation, but this is a user-run command. Is just telling people that it is not supported enough? The user won’t necessarily know of the restrictions. We should make sure nothing bad happens when the command is run in these situations. I think rejecting the request is a fine idea.

billnitzberg · February 23, 2017, 12:47am

HI @bayucan,

I was just looking back at this design and did not see the API and hook interfaces.

Also, what vnodes does “-a” actually release? In particular, if I have a job that has 4 vnodes on Mother Superior and 4 vnodes on a sister MOM, does it release 7 vnodes or only 4 vnodes? (And, is there a programatic way to figure out which vnodes are on Mother Superior so they can be released (or not) as desired?)

Thanks!

bayucan · February 23, 2017, 8:00am

It’s coming. It’s the last thing I need to add before I announce a new EDD version.[quote=“billnitzberg, post:37, topic:419”]
Also, what vnodes does “-a” actually release? In particular, if I have a job that has 4 vnodes on Mother Superior and 4 vnodes on a sister MOM, does it release 7 vnodes or only 4 vnodes? (And, is there a programatic way to figure out which vnodes are on Mother Superior so they can be released (or not) as desired?)
[/quote]

It will only release vnodes whose parent mom is not on a mother superior host. So in this case, only the 4 vnodes on sister mom are released.
Calling pbs_release_nodes on a vnode managed by a MS host would return:
“Can’t free <vnode_name> since it’s on a MS host”,
To figure out if a vnode is on an MS host, one can look into ‘pbsnodes -av’ output, and trace back the parent mom host via the ‘Mom’ attribute value.

bayucan · February 23, 2017, 8:11am

bhroam:

Even if we don’t support cpuset resizing (and maybe cgroups), I think it’ll provide a very nasty way for an user/admin to shoot themselves in the foot. If I am not mistaken, a cpuset is tied to specific node boards exclusively. If the server shows the new nodes as free, the scheduler will run a new job on them. When the new job gets to the mom, a new cpuset will not be able to be created and the job will be rejected. This will happen 20 times until it’s put on hold and the next job starts the same sequence of events.

I don’t know what will happen to cgroups. From what I understand about them, they’re not as tied to specific hardware like a cpuset is. We might end up leaving them around if the correct hook events don’t get called though.

I have no problem with restrictions on the initial implementation, but this is a user-run command. Is just telling people that it is not supported enough? The user won’t necessarily know of the restrictions. We should make sure nothing bad happens when the command is run in these situations. I think rejecting the request is a fine idea.

I was actually going to implement something where if pbs_release_nodes is called on a vnode whose parent mom has resources_available.arch = “linux_cpuset”, or vnode contains resources_available.PBScray* (ALPS restriction), then an error is returned. I need to add this info to the EDD.

bayucan · March 1, 2017, 9:04am

I now have version 13 of the node rampdown design. You might want to go to “Page History” and compare v.8 against v.13 (current). It incorporates all the comments from Feb 1.

Node Rampdown Design (v13)

Topic		Replies	Views
PP-725: new "keep <select>" option for "pbs_release_nodes" Developers	19	1124	November 29, 2019
PP-928: Reliable Job Startup Developers	44	4015	September 20, 2018
PP-389: Allow the admin to suspend jobs for node maintenance Developers	35	5153	July 13, 2017
PP-734: Ability to release limited resources when a job is suspended Developers	57	4516	March 9, 2018
PP-586: On a Cray X-series, create a vnode per compute node Developers	40	4878	January 10, 2017

PP-339 and PP-647:release vnodes early from running jobs

Related topics