PP-928: Reliable Job Startup

This note is to inform the community of work being done to introduce the reliable job startup feature.

Reliable Job Startup Design

Please feel free to provide feedback on this topic

Thanks for posting this. It looks like a nice feature. Comments below.

Interface 1 -> Log/Error

  • 1: I would suggest changing it to “Only PBS managers and operators are allowed to set rjselect”
  • 6: Does this cause PBS to retry the TM request or will the user/application have to handle this?

For the queuejob hook example I would recommend that we come up with one that gets the initial select and then from that build the rjselect statement

Interface 3 -> Log/Error

  • 1: Can we also provide a the node name(s) that did not join to the logs?

Ok, will update.

The user/application will have to handle this.

Good idea. I’ll come up with one.

Interface 3 -> Log/Error

That’s a fine thing to add. Will do.

Re: PP-928 v.6

“Reliable Job Start” is a great goal – thanks for working on this enhancement! In order to add a bit more context, my understanding is that the main use case is focused on:

  • Big (wide) jobs often have significant queue waiting times, as other jobs must finish and relinquish their resources to make room for a “big” job. (Generally, this is the result of common large-site policies and how they trade-off the goals of minimizing waiting time and maximizing utilization. This is generally the right approach, and works well in most cases.)
  • When PBS Pro detects a “bad” node, any job assigned to that “bad” node is terminated, and is (generally) re-queued. (Generally, as this is again a policy decision left to the user and site).
  • In the unfortunate situation a node is detected as “bad” during the startup of a “big” job, this will result in the “big” job being re-queued. If there aren’t sufficient additional nodes available, the “big” job will wait (again) a significant time for another chance to start.
  • This enhancement is proposing to eliminate the additional waiting time (caused by the “big” job being re-queued and not having enough additional nodes)

(Please let me know if I got this wrong or missed additional nuances.)

A big suggestion…

Since a lot of effort is left to the SysAdmin, and it seems the core issue is that PBS Pro kills a job when it detects a “bad” node, how about adding a new feature to PBS Pro that allows a job to continue running, even if a node is detected as “bad”?

Allowing a job to continue running (despite detecting “bad” nodes) is a generally useful feature, not just for this enhancement, and has been requested, e.g., to support fault-tolerant MPI. With this capability (plus some recovery bits), a SysAdmin could implement the use case for Reliable Job Startup, e.g.

  • Queuejob hook copies user-submitted “select” into a new custom resource “requested_select”, then added chunks to the “select” itself
  • An execjob_launch or execjob_prolog hook gets final information on any newly detected “bad” nodes (perhaps using the the proposed Interface 5, and then uses the new Node Ramp Down feature (recently added to PBS Pro) to “free” all the “bad” nodes plus any additional unwanted nodes. (The hook would also need to update the PBS_NODEFILE. Perhaps PBS Pro could supply a hook routine that generates a PBS_NODEFILE from a select to make that easy too.)
  • Heck, one could even keep extra nodes around until the job is running for 10 minutes, then free the extras – that would also handle the case when something in the application startup itself causes a “bad” node to be detected.

I guess I’m struggling with the current design, as it forces the SysAdmin to do a lot of the bookkeeping work (something that, in principle, PBS Pro is really good at doing), and it doesn’t address the “whole” problem (e.g., what if Mother Superior is “bad”, and what if a “bad” node is detected in the first 30 seconds of the job). I’m assuming this approach is being proposed to create something useful as a first step, while minimizing development effort. I feel like it should be possible to find a better way to test this direction without adding baggage to the overall PBS Pro design…

If the above is not compelling, then …

  • Suggest renaming “rjselect”. Generally, common practice for PBS Pro has been to avoid abbreviations whenever possible. Alternative suggestions: “select_with_padding”, “select_startup_pad”, “select_reliable_job_startup”, …

  • Suggest adding more details on which hooks run before and after the select is updated. Is all the reliable job startup stuff run before any exec job hooks? Which ones run before and which ones run after?

  • Suggest making the new “s” record in addition to the existing “S” (though I’m not positive about this). The idea is to ensure as much backward compatibility with existing accounting tools as possible. In any case, please explicitly define whether it is in addition or not.

Thanks again!

Hi Bill,
Yes, your understanding of the feature’s motivation is correct.

It’s an interesting thought. The only thing is that how will the user’s application handle such a situation. The job thinks it has all nodes to be good, so when some of the nodes start failing, the user’s application might also start failing causing job to be aborted.

If we use the pbs_release_nodes functionality, then that in itself takes care of re-generating the PBS_NODEFILE.

The current design will allow jobs to continue running even when bad sister nodes are detected (as determined by a begin, prologue, or launch hook executing on a sister mom) as long as the remaining nodes continue to satisfy the job’s original request. PBS will automatically shrink the job’s assigned node resources to match the minimuim list (i.e. user’s original request). These actions after waiting for the join job request to come back, up to a a maximum wait time (mom config join_job_alarm), or after waiting for the execjob_prologue and execjob_launch hooks to execute up to a time limit as well (mom config job_launch_delay). What I sense with your suggestion is for pbs to not automatically shrink the job’s assigned node resources, but let an admin take care of via prologue or launch hook to call pbs_release_nodes of some subset of failed nodes and also the extra nodes. So it’s actually the sysadmin doing the bookeeping in this case, rather than in the current design when pbs tries to do all of that internally, keeping things consistent with the user’s original request. Yes the disadvantage is that the current design is only allowing sister nodes to fail as a failing MS node will requeue the job.

The current design goes the other way where it preserves the original select request, and makes the job to always see things consistent with the original request, so as to not surprise the executing application.

Hi – thanks for the reply. I feel we are not yet having the same conversation…

Regarding: How about adding a new feature to PBS Pro that allows a job to continue running, even if a node is detected as “bad”?

There are already user applications that can handle node failures and happily continue executing (without aborting), so this feature would be useful immediately. It’s also needed to support Exascale jobs (with millions of cores) where node failures during execution will be common (as opposed to today, where node failures are exceptions). PBS Pro will need this feature; if not today, then soon.

Two things:

  • Both approaches require the SysAdmin to do a lot of bookkeeping. The current design requires the SysAdmin to figure out how to add to the jaselect; the alternative suggestion requires the SysAdmin to figure out which nodes to free.
  • The current design misses a huge opportunity to support additional use cases (beyond job launch) with very little change in effort (by either the SysAdmin or the Developer). As long as PBS Pro is forcing the SysAdmin to do a lot of work, why not get more value from this enhancement by making it more general?

Finally, don’t forget about the other parts:

Thanks again (again)!

I’ll talk to you some time today on this.

Ok, good suggestions on the alternatives.

Will do.

It’s a new accounting record. ‘S’ stays as is. Will make this explicit.

Bill suggested that most of this functionality could be implemented in hooks. This wouldn’t be the first time we’d implement core functionality this way. We’re implementing cgroups like this. If we could implement this feature by hooks, it adds greater flexibility to the site in how they want to add and release nodes from jobs. That is really nice.

What would be the limitations on the feature if it was implemented in hooks? I see there are several timeouts when MS is waiting for sisters to reply. Is that sort of flexibility available in a hook based solution?

Would any new hook events need to be implemented?

Other comment: I wouldn’t make the assumption that the first chunk is just a mother superior chunk. There is nothing wrong with qsub -l select=4:ncpus=16. Or even: qsub -l select=64:ncpus=1. Just because the MS will be allocated from the first chunk, doesn’t mean there aren’t others.

Also, should this feature encompass making MS reliable as well? If not, it’s still a single point of failure. It’s true that in a 4000 node job, it’s just one out of 4000, but failure possibility still there. I understand it goes against PBS’s mom-server architecture, but it is something to think about.

Bhroam

Not everything in this feature can be implemented in hooks. The part where mom has to update its internal tables when nodes need to be released to take away failed nodes and replaced with good nodes, still needs to be done internally in mom.

Not at this time but we will move the decision of pruning a job’s assigned nodes into the prologue hooks (or even launch hooks) using a new interface, rather than mom automatically making that decision internally. This adds flexibility.

Ok.

This feature is currently targeting only failure of sister nodes. If mother superior mom goes bad, then the job will still get requeued. Perhaps it can be extended later to include handling MS failure.

I’ve just revamped the Reliable Job Startup design based on comments received on first version. It’s now v1.7:

Reliable Job Startup Design v1.7

I made the following updates to the EDD:

  • Even in the presence of ‘select_reliable_startup’ resource, the original ‘select’ and ‘Resource_List’ values are retained, reflecting what the user originally requested.
  • ‘select_requested’ is actually a new job attribute and not a new resource.
  • Noted that pbs.event().job.release_nodes() is callable only in an execjob_prologue and execjob_launch hooks, catching the job when it has just started running, and ‘s’ accounting record is generated. Also, it makes sense to call it under the ‘if pbs.event().job.in_ms_mom()’ clause.
  • pbs.event().job.release_nodes() actually returns the modified pbs.job object.

Here’s v1.8 of the Reliable Job Startup design:

Reliable Job Startup Design v1.8

I’ve made more updates to the EDD to even simplify things. And this includes:

  • Dropped ‘select_reliable_startup’ resource and introduced a job attribute ‘tolerate_node_failures’ which when set to true, will allow jobs to run even if assigned nodes have failed.
  • Dropped the new ‘select_requested’ resource as it is not necessary.
  • In order for primary mom not to continue to wait a full ‘job_launch_delay’ time before executing the execjob_launch hook, allowed sister moms to send an acknowledgement to primary mom that they have executed their respective prologue hooks. This would tell the primary mom when all prologue hook executions are done.
  • Improved the logged messages when pbs.event().job.release_nodes() call failed to prune the job.

The updated design is in:
Reliable Job Startup Design v11

Hey,
Thanks for the updates. The doc looks pretty good. I mostly had issues with a log message here or there.

  1. In interface 1’s examples, the qalter example says . I think you mean
  2. You list 3 log messages under interface 1. Interface 1 is marked as stable. This will make the log messages stable as well. Two of the log messages print under DEBUG3. DEBUG3 is usually saved for deep debug messages. If that’s the case, they should be unstable.
  3. I disagree we need to create a new accounting log record for calling pbs_release_nodes(). The pbs_release_nodes RFE has a perfectly fine accounting log record (‘c’ I think) for this.
  4. The log messages listed in interface 3 have two #1’s.
  5. I dislike interface 3 log message 1-2 (the second #1). I don’t think it’s required. It’s basically going to be printed for every job.
  6. Interface 4 also has two #1’s.
  7. I similarly dislike interface 4 second #1 log message for the same reason.
  8. Maybe extend interface 4 #2’s log message a little to say what is about to happen. It says that not all of the prologue hooks completed, but not what will happen. The job gets put into execution, right?
  9. In interface 5’s example, you don’t need to say .keys(). If you put the dictionary in the for loop directly, it will iterate over the keys.
  10. Interface 5 log message #1. Is this a new message? Offlining nodes isn’t really a new thing.
  11. I have a thought about interface 7. You give the ability to increase all but the first chunk, or all chunks. What if a site wants to modify the first chunk differently than the rest? I suspect they’ll still be able to do this by parsing out the select spec and making the changes manually.
  12. Interface 8 second log message bullet: There’s=There’ll. Of course I dislike this message for the same reason I disliked the previous ones.
  13. Interface 8 log messages has a message that says not all updates to sister moms completed. You then say it’ll continue in the background. I find this confusing. You say that they didn’t complete, but they then go on and complete in the background.
  14. Interface 8 log bullet 3: You list the exec_vnode. Those can be HUGE. Are you sure you want to log it?
  15. Interface 8 log bullet 4: Can you make this message a little more explanatory? What updates? Also there is a bunch of space between the explanation and the log message itself. Is this intentional?
  16. Interface 8 log bullet 5: I find it strange that you get error messages after a successful release nodes call.
  17. In the big example at the bottom:
  • You say you need to have the job tolerate node failures. Why does a job require to be tolerant of node failures? What if the job isn’t tolerant of node failures. I’m sure some MPI’s will crap out if they lose nodes.
  • You save ‘site’ into ‘site’. I think you want to save ‘select’ into ‘site’

Bhroam

Releasing nodes for the purpose of reliably starting a job needs to be logged with a new accounting record, hence the new ‘s’ record. We don’t want the ‘c’ record used for both purposes.

It would be nice if interface 7 could handle the use cases I’ve described in my presentations about reliable job startup. At the moment that means changing interface 7 in two ways:

  1. Alter or replace first_chunk with something that does what Bhroam describes. Illustrative examples, assuming hook calls pbs.event().job.Resource_List[“select”].increment_chunks(2):
  • old_select = 1:ncpus=8+2:ncpus=4
    new_select = 1:ncpus=8+4:ncpus=4
  • old_select = 10:ncpus=8+3:ncpus=4
    new_select = 12:ncpus=8+5:ncpus=4
  1. Support percent-increase in addition to a strict number-of-chunks increase. Need to decide about rounding up/down/etc.

The effect of job attribute tolerate_node_failures is too broad. We will want PBS to distinguish between tolerating job startup failures, and tolerating failures once a job is already running.

-Greg

This is what happens right now when increment_chunks() is called as it is the equivalent to doing:
_pbs.event().job.Resource_List[“select”].increment_chunks(2, first_chunk=False)

where ‘first_chunk=False’ is the default which says don’t modify first chunk but modify the rest.

And this can be done by doing:

_pbs.event().job.Resource_List[“select”].increment_chunks(2, first_chunk=True)
which means to also modify the first_chunk in addition to the other chunks.

This could be done. Please take your pick: round up or round down.

I can just introduce a new job attribute: ‘tolerate_node_failures_at_startup’ to only tolerate node failures during job startup, leaving ‘tolerate_node_failures’ as is. The latter will take precedence if both are specified.

I’d like increment_chunks() to have the smarts internally to handle these, so the hook doesn’t have to worry about if the first chunk is count of 1 or 2+.

Let’s go with round up.

I don’t follow the idea of precedence here. Once PBS distinguishes between the two options, a job should be able to pick neither, both, or just one.

True, it should be obvious to PBS.

Changed <job script> to <jobid>

I’ve removed the unstable interface log messages as under Public visibility,
we no longer document such messages.

Fixed in next version.

Log message #1 actually is shown automatically upon mom startup or kill _HUP like
any other config file values. It would not get prnted at every job. I’ve updated the EDD to mention this fact.
I’ve taken out log message #2 from the EDD.

Fixed in next version.

I’ve taken it out.

I’ve modified thel log message #2 as:

“not all prologue hooks to sister moms completed, but job will proceed to execute”

Yes, you’re right. *.keys() is unnecessary.

It’s not a new thing. It’s not necessary to have this info.

In the next EDD update, I have expanded interface 7 to allow individual chunks to be incremented and also they can now be increased either by a number or percent as suggested by Greg.

Ok, I’ve removed the message from the EDD.

I’ve improved the verbiage.

Implementation-wise, it would only print up to the first LOG_BUF_SIZE (4096) bytes of value as it’s only outputting the log_buffer content.

I’ll add more explanation. The extra space was not intentional. Fixed in next version.

Yeah, it’s because there could have been outstanding request on the stream
when the release_nodes() happened.

I’m thinking of it as being up to the site or user. Bill mentioned that modern MPIs out there
can actually be tolerant of this.

Fixed in next version.

Another followup on interface 7: if my original select is

1:ncpus=5

then increment_chunks(2) would give me

3:ncpus=5

but do those additional 2 chunks provide any value? Isn’t it the case that the mother superior will only land on the very first chunk, so any adidtional chunks cannot be used (because mother superior isn’t reassignable to another chunk if the very first chunk ends up having a problem)?

-Greg

That is correct, Greg. Adding chunks to the first one in the spec list would not add much value given that it’s the first one that will always land in MS node. That was the reason I had a 'first_chunk=False" argument to the increment_chunks() in the first place, and made it as default. However, with the latest update, I’ve taken out the first_chunk parameter. The hook writer would need to write it out:
increment_chunks({0: 0, 1: 2, 2: 2, …, : 2})
so that the first chunk gets 0 added, while the rest 2 additional chunks are added…