PP-734: Ability to release limited resources when a job is suspended

Hi,

I’ve posted a design proposal for PP-734 to add support in PBS to release limited number of resources when a job is suspended.

Please have a look at the design proposal and provide your feedback.

Thanks,
Arun

@arungrover, here are my comments:

  1. It seems the phrase “In most cases suspended job holds on to the memory it would have consumed and just releases ncpus …” contradicts the statement in the sentence before it “… releases all the consumable resources requested by the job when it is suspended”.
  2. Are dynamic consumable resources affected too?
  3. Are there restrictions on what resource operations (e.g. setting, unsetting, altering) could be done while a job that is using the resource gets suspended?
  4. Please add a definition of terms. You mentioned at some point that suspension could be by qsig (-s suspend?) or preemption. And are all preemption methods (checkpointing, suspension, or requeueing) affected?
  5. Would it make sense to be able to specify also what resources can’t be released after job is suspended?

Hey Arun,
Here are my comments:

  1. In your intro paragraph you say that jobs usually hold onto their memory. I don’t think this is the use case this feature is targeting. In a normal unix system, a job’s memory will be swapped out when needed. The use case that is being targeted here is when there is no swap space. A job’s memory has nowhere to go, so it remains in use.
  2. I think res_released_on_susp needs to be a server attribute. When multisched goes in, there would be N res_released_on_susp attributes and that would cause issues.
  3. Change the 4th to the last bullet of interface one to say, “If unset…” and then change 3rd to the last bullet to say the attribute is unset by default.
  4. The last bullet if interface 2 is internal. I wouldn’t say it in an external design. Since you did mention it, I’d rather see the server handle setting the attribute when a job is suspended by either preemption or qsig. If the scheduler sets it via preemption and the server sets it via qsig, there could be inconsistencies.
  5. Interface 3: I think the resource type is resource_list, not resource.
  6. For interface 3 you should probably define consumable. I think it means a resource with the n or q flag.
  7. Interface 4 and 5 are debug messages. I wouldn’t call make them public stable.
  8. How does this feature interact with the admin suspend feature? It probably doesn’t. The reason for an admin-suspended job is to mark the node in maintenance. It doesn’t really matter about what resources are really released on the server side.
  9. How does this feature interact with limits? What happens if I have a limit on memory and only release ncpus. Will I still hit the memory limit when a job is preempted? I suspect so.

Those are my comments. I hope they help.
Bhroam

@vccardenas Thanks for reviewing the document. I’ve made some changes to document based on the comments I’ve got.

I’ve reworded it

I guess dynamic consumable resources aren’t something that one can make available on server/queue/node objects. In that sense even if these resources are added to res_release_on_susp list it wouldn’t make any difference because nothing will get released. This is probably applicable even today (without this change).

[quote=“vccardenas, post:2, topic:504”]
3. Are there restrictions on what resource operations (e.g. setting, unsetting, altering) could be done while a job that is using the resource gets suspended?
[/quote]Well the resource can really be altered if there is a job busy on it. It is the same way as it works today.

[quote=“vccardenas, post:2, topic:504”]
4. Please add a definition of terms. You mentioned at some point that suspension could be by qsig (-s suspend?) or preemption. And are all preemption methods (checkpointing, suspension, or requeueing) affected?
[/quote] The change is only about when a job is suspended. Suspension can only happen via a qsig command or using preemption.

I don’t know what benefit we will get out of making two lists. And, then there will be cases where some resources don’t exist in either of the lists then what do we do.
I thought about creating only a list of resources that would sustain on job when it is suspended, but I decided against it while writing this design proposal because in that case admins will have to mention all the consumable resources that they don’t want to release. Some sites have hundreds of custom resource and changing the field for every new resource you add wasn’t very user friendly in my opinion.

@bhroam Thanks for reviewing the document.
I’ve made the suggested changes.

I’ve reworded it.

I’ve changed it to be part of server attributes now.

Done!

Done!

Done!

Well consumable resource is clearly defined in our guides. I didn’t want to mention about what would make a resource a consumable resource. I hope it is okay.

Interface 5 is deleted as resources_released is now set by server. I made interface 4 as experimental.

You are right, it shouldn’t matter because it isn’t checking who is suspending the job. every suspend will have the same behavior.

In one of the points in interface 2 I mention that it will also release resources on server/queue objects. In that case the resources that are not released would still affect the limit checking.

@arungrover Neat feature. I have a few comments:
I believe that it is possible for “res_released_on_susp” to change even while a job is suspended. And that’s probably why you have “resources_released”, so PBS can know what to give back to the job when the job is ready to be resumed. It think it will help to clarify interface 2, sub-bullet 4 by adding the bold text: “This job attribute is populated at the time of job suspension only if “res_released_on_susp” server attribute…” Otherwise it seems like this job attribute can appear and disappear as res_released_on_susp changes even while a job is suspended. And that would be weird (and wrong).

Are interface #2 and #3 really for “Public” consumption? I know everyone will be able to see them, but do we want them to rely on them? If not, Private may be a better option.

Interface #4 please give the log event level (e.g. PBSEVENT_DEBUG2) so folks can know what to include in their server log_events setting in order to see the log message.

@lisa-altair Thanks for your review comments. I’ve made suggested changes to the document.

I think interface 3 can be marked as private since that is mostly getting consumed internally while releasing resources but interface 2 is probably for public consumption. Since res_released_on_susp could change at any time “resources_released” is on the only readable format using which users can identify that upon suspension what resources are released on each of the nodes.
I’ve marked interface 3 as private for now.

Thanks for making the changes. Looks good to me.

It originally seemed to me when configuring preemption it would be easier to list out what to retain (e.g. “mem”) vs listing out everything that should be released. It seems more intuitive for backward compatibility as well, if nothing is specified to retain then everything is released, just like today.

On the other hand, given that SIGSTOP only really causes ncpus to be “released” by the kernel, I’ve been told many customers just expect that to be PBS’s behavior (plus nppus) as well, so I guess making that the default behavior and allowing an admin to specify anything else they’d like to release would be OK.

Thanks @smgoosen for reviewing the document!

My opinion was that there won’t be a lot many things that we would want to release and I wanted to limit admins to go through the pain of modifying the list of resources to retain on the job every time a new consumable custom resource is created.
Reading your comment, I’m guessing you are probably ok with the “res_released_on_susp” approach as specified in the document, right?

@arungrover, in interface 2 “resources_released” you mention that “This attribute is set by server whenever it preempts a job using suspension.” So what happens for a “qsig -s suspend” - does this job atttribute not show up?

@vccardenas even with “qsig -s suspend” command “resources_released” attribute will be formulated. If it is confusing then I can reword the line.

@arungrover, yes that would clarify things.

@vccardenas I’ve modified the document based on your comments. Please have a look.
@lisa-altair Thanks for your sign-off! The document has undergone a slight change (interface 2, last bullet). If possible, please have a look again.

Thanks!

@arungrover, it still looks good to me.

@arungrover, thanks for the clarification. The design looks good to me.

You are correct, I am ok with the “res_released_on_susp” approach as specified in the document

One additional thought is that rather than relying on “unset” there should be a key word that means “everything” and that we should ship with that as the default so that out of the box customers unambiguously get the same behavior as they have previously.

That’s an interesting thought. I imagine it will be a little problematic to have an attribute which is supposed to store names of resources would then also be used to store a keyword.
If we go that route, then we will also have to make sure that no resource with name “everything” (or whatever keyword we decide on) should be allowed, what would be the behavior if “res_released_on_susp” is unset or cleared.

Using other attributes as examples, if unset they default to the “default”, which would be “everything”, that is, it acts as if the keyword were still there.

@smgoosen it sounds like you are suggesting to have PBS SET res_released_on_susp to “everything” when someone unsets it (qmgr -c “unset server res_released_on_susp”) or when the admin doesn’t set it. That seems weird to me, and it doesn’t match the behavior of other server attributes when they are not set. For example, by setting res_released_on_susp to “everything” it will cause res_released_on_susp to appear in the output of qmgr -c “p s”.

It would also make the design of the job attributes more intricate and confusing. Since we’d either have to special case the code to not print resources_released and resource_released_list if res_released_on_susp is set to “everything”. Or if we print them, there would need to be an explanation of why the word “everything” doesn’t show up, and instead we’d have duplicate values of existing job attributes (e.g. the value of resources_released would be the same as the value of exec_vnode).

PBS often has a default behavior, without actually setting the attribute. res_released_on_susp should match the existing behavior of other server attributes.