âReliable Job Startâ is a great goal â thanks for working on this enhancement! In order to add a bit more context, my understanding is that the main use case is focused on:
Big (wide) jobs often have significant queue waiting times, as other jobs must finish and relinquish their resources to make room for a âbigâ job. (Generally, this is the result of common large-site policies and how they trade-off the goals of minimizing waiting time and maximizing utilization. This is generally the right approach, and works well in most cases.)
When PBS Pro detects a âbadâ node, any job assigned to that âbadâ node is terminated, and is (generally) re-queued. (Generally, as this is again a policy decision left to the user and site).
In the unfortunate situation a node is detected as âbadâ during the startup of a âbigâ job, this will result in the âbigâ job being re-queued. If there arenât sufficient additional nodes available, the âbigâ job will wait (again) a significant time for another chance to start.
This enhancement is proposing to eliminate the additional waiting time (caused by the âbigâ job being re-queued and not having enough additional nodes)
(Please let me know if I got this wrong or missed additional nuances.)
A big suggestionâŚ
Since a lot of effort is left to the SysAdmin, and it seems the core issue is that PBS Pro kills a job when it detects a âbadâ node, how about adding a new feature to PBS Pro that allows a job to continue running, even if a node is detected as âbadâ?
Allowing a job to continue running (despite detecting âbadâ nodes) is a generally useful feature, not just for this enhancement, and has been requested, e.g., to support fault-tolerant MPI. With this capability (plus some recovery bits), a SysAdmin could implement the use case for Reliable Job Startup, e.g.
Queuejob hook copies user-submitted âselectâ into a new custom resource ârequested_selectâ, then added chunks to the âselectâ itself
An execjob_launch or execjob_prolog hook gets final information on any newly detected âbadâ nodes (perhaps using the the proposed Interface 5, and then uses the new Node Ramp Down feature (recently added to PBS Pro) to âfreeâ all the âbadâ nodes plus any additional unwanted nodes. (The hook would also need to update the PBS_NODEFILE. Perhaps PBS Pro could supply a hook routine that generates a PBS_NODEFILE from a select to make that easy too.)
Heck, one could even keep extra nodes around until the job is running for 10 minutes, then free the extras â that would also handle the case when something in the application startup itself causes a âbadâ node to be detected.
I guess Iâm struggling with the current design, as it forces the SysAdmin to do a lot of the bookkeeping work (something that, in principle, PBS Pro is really good at doing), and it doesnât address the âwholeâ problem (e.g., what if Mother Superior is âbadâ, and what if a âbadâ node is detected in the first 30 seconds of the job). Iâm assuming this approach is being proposed to create something useful as a first step, while minimizing development effort. I feel like it should be possible to find a better way to test this direction without adding baggage to the overall PBS Pro designâŚ
If the above is not compelling, then âŚ
Suggest renaming ârjselectâ. Generally, common practice for PBS Pro has been to avoid abbreviations whenever possible. Alternative suggestions: âselect_with_paddingâ, âselect_startup_padâ, âselect_reliable_job_startupâ, âŚ
Suggest adding more details on which hooks run before and after the select is updated. Is all the reliable job startup stuff run before any exec job hooks? Which ones run before and which ones run after?
Suggest making the new âsâ record in addition to the existing âSâ (though Iâm not positive about this). The idea is to ensure as much backward compatibility with existing accounting tools as possible. In any case, please explicitly define whether it is in addition or not.
Hi Bill,
Yes, your understanding of the featureâs motivation is correct.
Itâs an interesting thought. The only thing is that how will the userâs application handle such a situation. The job thinks it has all nodes to be good, so when some of the nodes start failing, the userâs application might also start failing causing job to be aborted.
If we use the pbs_release_nodes functionality, then that in itself takes care of re-generating the PBS_NODEFILE.
The current design will allow jobs to continue running even when bad sister nodes are detected (as determined by a begin, prologue, or launch hook executing on a sister mom) as long as the remaining nodes continue to satisfy the jobâs original request. PBS will automatically shrink the jobâs assigned node resources to match the minimuim list (i.e. userâs original request). These actions after waiting for the join job request to come back, up to a a maximum wait time (mom config join_job_alarm), or after waiting for the execjob_prologue and execjob_launch hooks to execute up to a time limit as well (mom config job_launch_delay). What I sense with your suggestion is for pbs to not automatically shrink the jobâs assigned node resources, but let an admin take care of via prologue or launch hook to call pbs_release_nodes of some subset of failed nodes and also the extra nodes. So itâs actually the sysadmin doing the bookeeping in this case, rather than in the current design when pbs tries to do all of that internally, keeping things consistent with the userâs original request. Yes the disadvantage is that the current design is only allowing sister nodes to fail as a failing MS node will requeue the job.
The current design goes the other way where it preserves the original select request, and makes the job to always see things consistent with the original request, so as to not surprise the executing application.
Hi â thanks for the reply. I feel we are not yet having the same conversationâŚ
Regarding: How about adding a new feature to PBS Pro that allows a job to continue running, even if a node is detected as âbadâ?
There are already user applications that can handle node failures and happily continue executing (without aborting), so this feature would be useful immediately. Itâs also needed to support Exascale jobs (with millions of cores) where node failures during execution will be common (as opposed to today, where node failures are exceptions). PBS Pro will need this feature; if not today, then soon.
Two things:
Both approaches require the SysAdmin to do a lot of bookkeeping. The current design requires the SysAdmin to figure out how to add to the jaselect; the alternative suggestion requires the SysAdmin to figure out which nodes to free.
The current design misses a huge opportunity to support additional use cases (beyond job launch) with very little change in effort (by either the SysAdmin or the Developer). As long as PBS Pro is forcing the SysAdmin to do a lot of work, why not get more value from this enhancement by making it more general?
Bill suggested that most of this functionality could be implemented in hooks. This wouldnât be the first time weâd implement core functionality this way. Weâre implementing cgroups like this. If we could implement this feature by hooks, it adds greater flexibility to the site in how they want to add and release nodes from jobs. That is really nice.
What would be the limitations on the feature if it was implemented in hooks? I see there are several timeouts when MS is waiting for sisters to reply. Is that sort of flexibility available in a hook based solution?
Would any new hook events need to be implemented?
Other comment: I wouldnât make the assumption that the first chunk is just a mother superior chunk. There is nothing wrong with qsub -l select=4:ncpus=16. Or even: qsub -l select=64:ncpus=1. Just because the MS will be allocated from the first chunk, doesnât mean there arenât others.
Also, should this feature encompass making MS reliable as well? If not, itâs still a single point of failure. Itâs true that in a 4000 node job, itâs just one out of 4000, but failure possibility still there. I understand it goes against PBSâs mom-server architecture, but it is something to think about.
Not everything in this feature can be implemented in hooks. The part where mom has to update its internal tables when nodes need to be released to take away failed nodes and replaced with good nodes, still needs to be done internally in mom.
Not at this time but we will move the decision of pruning a jobâs assigned nodes into the prologue hooks (or even launch hooks) using a new interface, rather than mom automatically making that decision internally. This adds flexibility.
Ok.
This feature is currently targeting only failure of sister nodes. If mother superior mom goes bad, then the job will still get requeued. Perhaps it can be extended later to include handling MS failure.
Even in the presence of âselect_reliable_startupâ resource, the original âselectâ and âResource_Listâ values are retained, reflecting what the user originally requested.
âselect_requestedâ is actually a new job attribute and not a new resource.
Noted that pbs.event().job.release_nodes() is callable only in an execjob_prologue and execjob_launch hooks, catching the job when it has just started running, and âsâ accounting record is generated. Also, it makes sense to call it under the âif pbs.event().job.in_ms_mom()â clause.
pbs.event().job.release_nodes() actually returns the modified pbs.job object.
Iâve made more updates to the EDD to even simplify things. And this includes:
Dropped âselect_reliable_startupâ resource and introduced a job attribute âtolerate_node_failuresâ which when set to true, will allow jobs to run even if assigned nodes have failed.
Dropped the new âselect_requestedâ resource as it is not necessary.
In order for primary mom not to continue to wait a full âjob_launch_delayâ time before executing the execjob_launch hook, allowed sister moms to send an acknowledgement to primary mom that they have executed their respective prologue hooks. This would tell the primary mom when all prologue hook executions are done.
Improved the logged messages when pbs.event().job.release_nodes() call failed to prune the job.
Hey,
Thanks for the updates. The doc looks pretty good. I mostly had issues with a log message here or there.
In interface 1âs examples, the qalter example says . I think you mean
You list 3 log messages under interface 1. Interface 1 is marked as stable. This will make the log messages stable as well. Two of the log messages print under DEBUG3. DEBUG3 is usually saved for deep debug messages. If thatâs the case, they should be unstable.
I disagree we need to create a new accounting log record for calling pbs_release_nodes(). The pbs_release_nodes RFE has a perfectly fine accounting log record (âcâ I think) for this.
The log messages listed in interface 3 have two #1âs.
I dislike interface 3 log message 1-2 (the second #1). I donât think itâs required. Itâs basically going to be printed for every job.
Interface 4 also has two #1âs.
I similarly dislike interface 4 second #1 log message for the same reason.
Maybe extend interface 4 #2âs log message a little to say what is about to happen. It says that not all of the prologue hooks completed, but not what will happen. The job gets put into execution, right?
In interface 5âs example, you donât need to say .keys(). If you put the dictionary in the for loop directly, it will iterate over the keys.
Interface 5 log message #1. Is this a new message? Offlining nodes isnât really a new thing.
I have a thought about interface 7. You give the ability to increase all but the first chunk, or all chunks. What if a site wants to modify the first chunk differently than the rest? I suspect theyâll still be able to do this by parsing out the select spec and making the changes manually.
Interface 8 second log message bullet: Thereâs=Thereâll. Of course I dislike this message for the same reason I disliked the previous ones.
Interface 8 log messages has a message that says not all updates to sister moms completed. You then say itâll continue in the background. I find this confusing. You say that they didnât complete, but they then go on and complete in the background.
Interface 8 log bullet 3: You list the exec_vnode. Those can be HUGE. Are you sure you want to log it?
Interface 8 log bullet 4: Can you make this message a little more explanatory? What updates? Also there is a bunch of space between the explanation and the log message itself. Is this intentional?
Interface 8 log bullet 5: I find it strange that you get error messages after a successful release nodes call.
In the big example at the bottom:
You say you need to have the job tolerate node failures. Why does a job require to be tolerant of node failures? What if the job isnât tolerant of node failures. Iâm sure some MPIâs will crap out if they lose nodes.
You save âsiteâ into âsiteâ. I think you want to save âselectâ into âsiteâ
Releasing nodes for the purpose of reliably starting a job needs to be logged with a new accounting record, hence the new âsâ record. We donât want the âcâ record used for both purposes.
It would be nice if interface 7 could handle the use cases Iâve described in my presentations about reliable job startup. At the moment that means changing interface 7 in two ways:
Alter or replace first_chunk with something that does what Bhroam describes. Illustrative examples, assuming hook calls pbs.event().job.Resource_List[âselectâ].increment_chunks(2):
Support percent-increase in addition to a strict number-of-chunks increase. Need to decide about rounding up/down/etc.
The effect of job attribute tolerate_node_failures is too broad. We will want PBS to distinguish between tolerating job startup failures, and tolerating failures once a job is already running.
This is what happens right now when increment_chunks() is called as it is the equivalent to doing:
_pbs.event().job.Resource_List[âselectâ].increment_chunks(2, first_chunk=False)
where âfirst_chunk=Falseâ is the default which says donât modify first chunk but modify the rest.
And this can be done by doing:
_pbs.event().job.Resource_List[âselectâ].increment_chunks(2, first_chunk=True)
which means to also modify the first_chunk in addition to the other chunks.
This could be done. Please take your pick: round up or round down.
I can just introduce a new job attribute: âtolerate_node_failures_at_startupâ to only tolerate node failures during job startup, leaving âtolerate_node_failuresâ as is. The latter will take precedence if both are specified.
Iâd like increment_chunks() to have the smarts internally to handle these, so the hook doesnât have to worry about if the first chunk is count of 1 or 2+.
Letâs go with round up.
I donât follow the idea of precedence here. Once PBS distinguishes between the two options, a job should be able to pick neither, both, or just one.
Iâve removed the unstable interface log messages as under Public visibility,
we no longer document such messages.
Fixed in next version.
Log message #1 actually is shown automatically upon mom startup or kill _HUP like
any other config file values. It would not get prnted at every job. Iâve updated the EDD to mention this fact.
Iâve taken out log message #2 from the EDD.
Fixed in next version.
Iâve taken it out.
Iâve modified thel log message #2 as:
ânot all prologue hooks to sister moms completed, but job will proceed to executeâ
Yes, youâre right. *.keys() is unnecessary.
Itâs not a new thing. Itâs not necessary to have this info.
In the next EDD update, I have expanded interface 7 to allow individual chunks to be incremented and also they can now be increased either by a number or percent as suggested by Greg.
Ok, Iâve removed the message from the EDD.
Iâve improved the verbiage.
Implementation-wise, it would only print up to the first LOG_BUF_SIZE (4096) bytes of value as itâs only outputting the log_buffer content.
Iâll add more explanation. The extra space was not intentional. Fixed in next version.
Yeah, itâs because there could have been outstanding request on the stream
when the release_nodes() happened.
Iâm thinking of it as being up to the site or user. Bill mentioned that modern MPIs out there
can actually be tolerant of this.
Another followup on interface 7: if my original select is
1:ncpus=5
then increment_chunks(2) would give me
3:ncpus=5
but do those additional 2 chunks provide any value? Isnât it the case that the mother superior will only land on the very first chunk, so any adidtional chunks cannot be used (because mother superior isnât reassignable to another chunk if the very first chunk ends up having a problem)?
That is correct, Greg. Adding chunks to the first one in the spec list would not add much value given that itâs the first one that will always land in MS node. That was the reason I had a 'first_chunk=False" argument to the increment_chunks() in the first place, and made it as default. However, with the latest update, Iâve taken out the first_chunk parameter. The hook writer would need to write it out:
increment_chunks({0: 0, 1: 2, 2: 2, âŚ, : 2})
so that the first chunk gets 0 added, while the rest 2 additional chunks are addedâŚ