Proposal of interface regarding hold/release of subjob(s) and job array

In course of dev and testing, I realized that qrls on a held subjob will require another qrls on the job array for the entire job array to recover. So I now think two qrls are redundant and there is no need of enabling qrls to directly release a held subjob. Instead I will modify qrls to indirectly release held subjob when its held parent is released. I have also added a new line to the second interface that introduces a comment on held job array with a convenient message to easily identify the subjob that breached the retry limit.

I have accordingly modified the EDD

Thanks @Shrini-h. Sorry about the late breaking questions.

Will the run_count values of individual X state subjobs which were not held be available in qstat -xft after the array as a whole has finished, or will they be faked from the parent?

It is not clear to me from you comment above at Proposal of interface regarding hold/release of subjob(s) and job array nor from the EDD.

What about seeing the actual run_counts after the entire array finishes and the subjobs are listed in history in F state?

Also, can you please confirm that the accounting log E records for the subjobs will record the correct run_count?

The run_count values of each individual subjob in X|F state (whether it was previously held or not ) will be available in qstat -xft i.e in job_history_enbale == True case.
However the run_count attribute is faked from parent when job_history_enable == False case in qstat -ft as each subjob is purged (including individual attributes) after completion (in X state).

Yes actual run_count values will be seen for subjobs in F state.

I confirm that the accounting log E records for all the subjobs will record the correct run_count. Further this is true for all values of job_history_enable. i.e job history enabled or Not, the accounting log E records will record the correct run_count values for each subjob.

Thank you @Shrini-h!

Looking back up at @arwild01’s use case from Feb 4th, can you confirm that the correct subjob run_count is also present in applicable mom (sub)job hook events?

@scc Yes, I confirm that the correct subjob run_count is also present in applicable mom (sub)job hook events !

1 Like

Sorry to make this point a little late, but I have a concern on how we are solving this issue. As far
as I understand, the aim was to reduce the cpu time and log space wasted due to failing subjobs.

With current approach we are putting a hold on Parent job only when single subjob reaches a run_count
of 20. If an Array job is of large size, this approach can still face the same issue that we are trying to solve. IMO we should held the Parent job if cummulative failures on subjobs reaches 20. I remember having discussion on array jobs long back, where we agreed that an Array job as an whole should be treated like a normal job, but not the subjobs. In such scenario, every subjob failure is actually a failure count for the parent job and should be consider to make the decision on whether to hold the Parent job or not.

If we go with the current approach, we may still end up wasting resources. For eg. an array job of 1000 subjobs can fail 19000 times and still not held. Where each subjob is failing but none has reached 20 yet. And this number grows as the array job size grows.

2 Likes

Just a suggestion… what if we maintain a cumulative fail count for the array itself? Treat subjobs like regular jobs and hold them if they fail 20 attempts. Hold the entire array once the cumulative fail count reaches 10% of total subjob count or some configurable threshold.

Hi @mkaro,
That’s an interesting approach, but I was thinking what’s the point of letting subjobs run if we know some of the subjobs are having issues. The job will still be unfinished and if it’s a script issue then resubmitting the jobs seems to be the only option , in such scenario letting some of subjobs running while we encountered run_count threshold on many other subjobs, may still not be appropriate. And if we made this threshold configurable, then we are kinda distinguishing between an array and the normal job that we have tried to avoid so far.

I guess we should approach this in more conservative way, irrespective what is causing subjobs to fail, if the failure has crossed 20 benchmark cumulatively then let’s hold the parent job, and let admin make the decision of what to do with it.

1 Like

The real aim is to avoid locking up the mom and its resource for ever.

Why?, you are fixating on holding the parent job when all the subjobs still have few game lives left.

As time pass by, things change. Past the merge of https://github.com/PBSPro/pbspro/pull/590 the subjobs have grown into teenagers who have moved out of parent’s house. Its the motherly emotions making you talk :rofl:

My previous points answer this. PBS gives them last chance to play.

+1, may be we can consider this based on feedback from fields by admins

So is there a thought out threshold on till how long mom resources can be wasted for failing jobs.

Cause an array job is incomplete even if all but one has completed. And total waste of resources if that one subjob failed due unrecoverable issues ( like script problem, input problem, permission issues).

Thanks for pointing out the previous rfe, but was expecting a much more mature reply :stuck_out_tongue: .

Sorry, I didn’t get which previous point. Where is the line to decide how long resources is allowed to be wasted and how are we going to confine it.

AFAIK Mom resources are allocated once the sched decides the node on which subjob will run, and they stay allocated till the subjob runs. Or in case of this bug’s scenario, it will stay allocated till it retries 20 times. I dont know if there is a thought out threshold, may the number 20 might indirectly answer that.

Just having an incomplete array job doesnt mean there is a waste of resource.

You know me now. I like to keep my life fun.

Not sure if I have precise answer. But as I said before the number 20 might answer it.

And that’s what I raised concern about, if it’s 20 , then with the proposed solution we can end up with subjobs failure much more than 20. And this fail count will grow proportionally to the array size. So indirectly will still haven’t solve the issue of avoiding unnecessary wastage of MOM resources.

It’s not that an incomplete array job is waste of resource, but not able to put a pin on a failing job within given failure threshold is.

@dilip-krishnan: I’m thinking of the case where there are a few black-hole nodes in the complex. The array jobs will run just fine on any nodes other than the black-hole nodes. It’s unlikely the scheduler is going to select the same node to run a failed job 19 times in a row if the complex is of reasonable size. I think that was the point of choosing 20 retry attempts in the first place. If we detect a problematic array job, we want to hold it so that other jobs can utilize the resources. IMHO, a problematic array is one where 10% of all run attempts end in failure. Someone else might think 20% is a better value, so we should probably make it configurable.

@dilip-krishnan I see what you are trying to say but I think your example for 19000 subjobs failures to make array-parent hold is probably an extreme example of what can happen.
On a normal operational complex, scheduler will stop looking at the array-parent as soon as it finds out the first subjob that couldn’t run. So for an array-job with 1000 subjobs if scheduler runs first n jobs and finds n+1th job couldn’t run, it will move on to next job, instead of moving to the next subjob of the same array.
Now, with Shrini’s proposed fix, the parent gets held as soon as one subjob hits a run_count of 20, which probably would be sooner than retrying running all the subjobs in every cycle. Apart from that, admin can actually specify the run_count for array jobs in a queuejob hook to a lower number of retries before it is held.

In the solution you mentioned, where we keep a failure count on the array-parent (whenever a subjob fails to run), I think, there are more chances of the parent being held because of intermittent subjob failures (and the probability increases with more subjobs in the array).

1 Like

@arungrover thanks for explaining in a conclusive way which I failed to.

@dilip-krishnan Hope arun’s explanation answers all your doubts.

@nithinj and All.

And with that I hope we reached on conclusion that the current design is good to go. (Implemented at https://github.com/PBSPro/pbspro/pull/1119)

Thank you all for the contributions on shaping the design.

Hi @arungrover,
Thanks for explaining the scheduling aspect of the array job, wasn’t aware of this. I can now understand how the example i was putting in will be a very rare case to happen. In such case the
suggested fix seems to be fine.

Thanks again for the explanation. :slight_smile: