Proposal of interface regarding hold/release of subjob(s) and job array

Currently the pbs commands qhold and qrls can be used on jobs and job arrays, but not on subjobs or ranges of subjobs.

I propose we allow qhold and qrls on subjob.

Interface Design Doc is here: https://pbspro.atlassian.net/wiki/spaces/PD/pages/1049854035/Changes+to+run+count+attribute+hold+and+release+operation+of+subjob+s+in+a+Job+Array

Thank you

Seeing this reminded me of a use case my company hasā€¦ we have an application thatā€™s building ā€œworkflowsā€ inside PBS by submitting multiple jobs with dependencies. Some of the ā€œjobsā€ inside of this workflow are actually array jobs. For various reasons the developers have elected to use ā€œafterokā€ dependencies (rather than afterany). We have cases now where some of the tasks of the array fail to complete their work (say the user underestimated the walltime for the array and not all of the tasks have the exact same runtime). As it stands now, the workflow essentially ā€œstopsā€ execution and abandons the jobs in the queue with unsatisfied dependencies. What the developers have decided to do is to leverage a hook that I have to run a ā€œuser epilogueā€ (as opposed to a system epilogue) and if that script exits with a certain status my hook will reject the job, but with a requeue. All of this works. The next issue is this can essentially create an infinite loop of tasks being requeued for ever. So the developers decided they want to enhance their ā€œuser epilogueā€ to check the run count and if the count is above a certain threshold they wonā€™t request the requeueā€¦ but as of our current version of PBSPro (13.0.407) tasks donā€™t have a run_count at allā€¦ so they canā€™t do this.

So long story shortā€¦ we really want to see array tasks have a run_count that goes up with each re-executionā€¦ and I think our Use Case is a little bit different then whatā€™s described above.

Since Iā€™m not seeing any immediate critical opinions, Iā€™m assuming its safe for me to start with design of proposed interface. Off course, meanwhile this forum will be open for down votes or suggestions or feedback on the proposal.

@arwild01, thank you for chipping in with the use case. Making run_count of subjob to increment for each rerun by server is core of the proposed solution. So I fee this will definitely help your use case.

Thanks @Shrini-h. your proposal looks good to me. I have few questions on the design.

  • I am not clear with the use case behind holding the whole parent array if any of the subjob is held. is this only true in case of system hold or for user hold as well?

  • what happens if release fails on any of the subjob in case of "qrls ? will it throw error?

  • Will the subjob comment and parent job array comment be set as why they are held? also job comment will only be set for held subjob or for all subjobs?

  • " * qhold can be supplied with subjob id range as argument to apply any type of Hold on all running subjobs in the range" -> why only running subjobs?

  • please mention the format for specifying subjob ranges

@anamika Thank you for the questions, pls find my answers below

Yes this is only for system hold, the use case is when there is an erroneous execution hook. Without holding the parent job, the server and scheduler will continue instantiating further subjobs which also gets held after 20 retries. Hence by putting a system hold on the parent array, the server and sched can skip the whole array till the admin or user corrects the hook.

the error will be same as with a regular job. I have now added this info in the doc

Yes the comments will be set as with a regular job. It will be set only for held subjob, however any uninstantiated queued subjobs will reflect the comment set on the parent which is the way subjob is displayed. I have now added this info in the doc

I wanted to mean the ones that have not yet finished. Thanks for catching this. I have now edited this info in the doc.

Done. I have now added it as a note at the bottom

got it and makes sense. can you also add a line under system gold explicitly mentioning if user, other, or bad password hold (any hold other than system) is applied on subjobs then parent job array will not be held?

few nits:
ā€œmore like to fail to launchā€ should be ā€œmore likely to fail to launchā€
ā€œfinsishedā€ should be ā€œfinishedā€

  • Thanks for adding the error message for qrls. could you also update it to mention that subjob will remain in H state.

Hey @Shrini-h,
I have one major concern. When you apply an action to a job array, you apply that action to the whole job array (running and queued subjobs). When a running job is held, it is checkpointed. This means if you apply a system hold on the entire job array, youā€™ll checkpoint the running jobs. Either that or you meant to just hold the queued subjobs which would not be consistent with other applications of the actions on the whole job array itself.

Also, your design document uses the old design document template. Consider using the new template since it provides a better idea of what you are thinking to do.
https://pbspro.atlassian.net/wiki/spaces/DG/pages/293076995/PBS+Pro+Design+Document+Guidelines

I have now realized to fix the ā€œbugā€ I just need to enable system hold release of single subjob, hence I have reduced the interfaces and moved the explanation to the design doc page. Please let me know your opinions

wont be addressing these, as the interface stands deleted

Done

Sorry I might have not explained it right, I didnā€™t mean to hold the entire array (all subjobs), but only the parent job array.

Done. (sorry, the new design template was not reflected when creating the blank page)

@Shrini-h
I understand that you want to put a hold on the parent job array, but I am not sure that is consistent with how weā€™ve treated job arrays in the past. In the past when you act on 1234[], you are acting on all of the subjobs in 1234[]. You are putting a hold on 1234[], so shouldnā€™t this hold all of the queued subjobs and checkpoint all of the running subjobs? Wonā€™t it be confusing to see the job array in state H and for there to be running subjobs? That being said, your intension is good. if one job is rejected by a hook, them most likely all of them will be since they are identical. The thing is that we usually donā€™t get in the business of second guessing hooks. They are meant for customers to do whatever they want. What if the begin hook rejected a subjob for a transient issue? A subsequent subjob may still be able to run.

I also think the -t option to qrls is unnecessary. We can directly act upon the subjob ids themselves. The reason we have the -t option to qstat is that it would be super-messy to print all the subjobs all the time. We added the option to print them if they were wanted.

Bhroam

It is infact consistent, for example when we do qhold 1234[] itā€™s running subjobs continue to run and queued subjobs stay put in ā€˜Qā€™ state (kind of equivalent to a ā€œholdā€). Since only the parent job is set to ā€˜Hā€™ state, I mentioned it as holding the parent job only, and I donā€™t intend to change this behavior.

checkpoint not supported on subjobs. (which I feel should be enabled in future)

Since we are holding after 20 retries, it will be safe to assume its not due to a transient or ephemeral issue. I feel the benefit of holding the parent Job array (assume a very big array job) outweighs the benefit of being tolerant to transient condition. Holding the array for this transient condition also act as a way to notify it to the admin who can release the hold with qrls interface.

The first point of the interface introduces a way to act upon the subjob id directly
The second point (ā€œ-tā€) option is similar to qstat where we give admin a way to remove hold on held subjobs and their parent all in one go.
Currently we already have a way to unhold a held job array qrls 1234[]. This we can use to unhold the parent without touching the held subjobs (i.e the other queued subjobs can continue to get run). I have added this point to the interface now

Interesting. I thought doing a qhold on the parent array would act on all of the subjobs. Cool, it is consistent.

OK, so you are saying that if any one subjob failed to run more than 20 times, weā€™ll hold the parent job array. Sorry, I must have misread the document. This seems consistent with what we do today with normal jobs. If the use case behind this change is to fix the issue with the infinite loop of running a job on a black hole node, then do we need to hold individual subjobs? It sounds like having the run_count per subjob is enough, since weā€™re going to hold the parent array when we reach 20 times.

Iā€™m not sure I agree with the first point. Why canā€™t you act on 1234[1] to act on the subjob directly?
As for the second point, Iā€™m still not sure I see the need to hold an individual subjob. Theyā€™re supposed to be identical. Why would we need to differentiate between them this way? Weā€™d also have to make the scheduler understand more about individual subjobs. Right now it knows that if a job array is in the B state, it has control to make and run subjobs based on the indices remaining. If we are making them each more special, it has to have have more special case code to handle this case.

Bhroam

Not clear if you are agreeing or disagreeing.
subjobs already have run_count, but since they are not incremented between each retries, it never reaches the limit of 20 making server and mom loop infinitely (ex exec begin hook error). What Iā€™m trying to propose with parent job is: once a subjobā€™s run_count breaches the limit, immediately hold the subjob and its parent array (not after parent array reaching 20 times). Hope this clarifies my point

because that is currently not allowed. Currently qrls can be used on jobs and job arrays, but not on subjobs or ranges of subjobs.
so this first point is to enable exactly what you are asking for.

firstly, Iā€™m guessing you have might have missed the context here. All the points under Interface 3 propose changes to qrls command which is used to release hold. Either that or Iā€™m missing the context of your query. Iā€™m stating the second point here when qrls is invoked with a system Held Job Array id with new "-t" option, prior to releasing the hold on it, all of its system Held subjobs are released
Anyways assuming we are talking about releasing of hold on subjob, I feel that your concern on having more special case code doesnā€™t withstand as, the subjobs in question here have already been instantiated and broken away from the indices remaining.

If we still donā€™t agree on the proposal of ā€œ-tā€ option for qrls, I donā€™t mind dropping it. It made more sense when I initially proposed changes to qhold to enable it to hold multiple subjobs in subjob range.

I will wait for few days before dropping this, in case anybody else backs this proposal. Withoug this option, the admin will have to issue multiple qrls commands once for the parent array in ā€˜Hā€™ and again once for each subjob that are in ā€˜Hā€™.

Iā€™m sorry I wasnā€™t clear. I am agreeing with you. I think we should hold the parent job array if any one of the subjobs failed to run 20 times.

Iā€™d rather you allow qrls to act on a subjob than add an option to allow it to act on a subjob. The new option doesnā€™t seem necessary.

I donā€™t understand why you canā€™t just act on 1234[1] or 1234 and not add the new option. Like I said before, qstat has the -t option for cleanliness. If you had many 10000 way job arrays, the qstat output would be very messy if you always printed the subjobs. This is why we only print the job array itself unless the subjobs are asked for. I donā€™t see the same use case for qrls

Maybe Iā€™m misunderstanding how array_indices_remaining will be used going forward. Right now it contains all subjobs that are not running or done. Are you saying that when a subjob is instantiated, itā€™ll be removed from the list? The attribute loses some of its meaning if that is the case. Now itā€™ll be the queued subjobs that are still part of the array. This isnā€™t all of the queued subjobs though.

In any case, in most commands when you type 1234, you mean the entire job array and all of the subjobs. Iā€™d rather this continue to be the case. Do a qrls on 1234 and the parent job and all the held subjobs get released.

Of course I still donā€™t see the use case for holding a specific subjob. They are identical jobs, so why would you want to treat one differently?

Bhroam

The subjob and the parent job array. i.e we will be seeing subjob and its parent array job in H state in qstat -t output.

I believe you are agreeing with qrls can be supplied with a subjob id as argument to release system Hold on it but not the point introducing ā€œ-tā€ option. No worries, I have deleted it now, anyway its not necessary for my bug fix, I had proposed it thinking logical completeness that I had in my mind. Based on the current usage of qhold and qrls on job array, I logically associated it to mean stoping and enabling spawning operation of queued jobs, hence I wanted to preserve this meaning.

We already have the later which now is being modified and former is being introduced in line qrls can be supplied with a subjob id as argument to release system Hold on it

As stated before I have done away with the new ā€œ-tā€ option.

Please feel free to skip this para, if you dont want to know my rationale behind the ā€œ-tā€ option. I proposed it to preserve the logical meaning of current hold and release operation on array job. I also envisioned the possibility of enabling checkpointing on subjob in the future, then qhold of array job with -t could have meanā€™t hold the running subjobs aswell, whereas without -t (existing behavior) donā€™t spawn further subjob but let running subjobs continue. My now deleted proposal of -t option was analogous to this design. i.e qrls without -t option (kind of existing behavior) will re-enable spawning of new subjobs while not touching the held sujobs, where as qrls with -t option will also re-enable spawning and release its all held subjobs.

array_indices_remaining contain all the subjobs in Q state. including the instantiated subjobs that are still in Q. Since here Iā€™m proposing instantiated subjobs going to ā€˜Hā€™ state, they wonā€™t be included in the array_indices_remaining.

I agree, I have modified the interface to reflect this.

Thatā€™s because the specific instantiated subjob encountered the problem. Holding it will preserve the job object that failed to launch on mom for further perusal of admin or user.

Thanks for explaining, @Shrini-h. I have one last minor comment and a question. If weā€™re keeping the held subjob around so the admin can look at it, how will they? Via qstat? Will it look different from other subjobs? Iā€™m not opposed to keeping it around, but I just want to make sure weā€™re doing it for a valid reason.

The question has to do with what happens to the non-instantiated queued subjobs with a held job array. Since the queued subjobs are basically a copy of the job array, will they show up as held? I think this is perfectly fine, Iā€™m just wondering.

Thanks,
Bhroam

@bhroam
The valid reason is to persist the run_count value of subjob. As part of the implementation of interface 1, I will be removing many of the job purge operations specific to subjobs. With this a subjob in H state will be an independent job obj with itā€™s own set of attributes. So yes qstat can be used on this Held subjob. For the other subjobs (that are not yet spawned) the will stay in Q state and a qstat on those will fake their attributes from the parent, which is the current behavior and I plan to keep it as it is.

That makes sense. Thanks for answering my questions. Iā€™m happy with the document now. Thanks for making all the changes.

Bhroam

Thank You @bhroam for helping me shape the design.
I will start with the dev work needed.

Meanwhile Iā€™m thinking it would be prudent to have a new comment set on job array when it getā€™s held due to subjob retrying more than 20 times.

Also Iā€™m wondering if its better to have the job array held when the subjob fails to run due to provisioning fails, bad passwords, JOB_EXEC_FAILUID,JOB_EXEC_FAIL_PASSWORD,JOB_EXEC_FAIL_SECURITY.