Seeing this reminded me of a use case my company hasā¦ we have an application thatās building āworkflowsā inside PBS by submitting multiple jobs with dependencies. Some of the ājobsā inside of this workflow are actually array jobs. For various reasons the developers have elected to use āafterokā dependencies (rather than afterany). We have cases now where some of the tasks of the array fail to complete their work (say the user underestimated the walltime for the array and not all of the tasks have the exact same runtime). As it stands now, the workflow essentially āstopsā execution and abandons the jobs in the queue with unsatisfied dependencies. What the developers have decided to do is to leverage a hook that I have to run a āuser epilogueā (as opposed to a system epilogue) and if that script exits with a certain status my hook will reject the job, but with a requeue. All of this works. The next issue is this can essentially create an infinite loop of tasks being requeued for ever. So the developers decided they want to enhance their āuser epilogueā to check the run count and if the count is above a certain threshold they wonāt request the requeueā¦ but as of our current version of PBSPro (13.0.407) tasks donāt have a run_count at allā¦ so they canāt do this.
So long story shortā¦ we really want to see array tasks have a run_count that goes up with each re-executionā¦ and I think our Use Case is a little bit different then whatās described above.
Since Iām not seeing any immediate critical opinions, Iām assuming its safe for me to start with design of proposed interface. Off course, meanwhile this forum will be open for down votes or suggestions or feedback on the proposal.
@arwild01, thank you for chipping in with the use case. Making run_count of subjob to increment for each rerun by server is core of the proposed solution. So I fee this will definitely help your use case.
Thanks @Shrini-h. your proposal looks good to me. I have few questions on the design.
I am not clear with the use case behind holding the whole parent array if any of the subjob is held. is this only true in case of system hold or for user hold as well?
what happens if release fails on any of the subjob in case of "qrls ? will it throw error?
Will the subjob comment and parent job array comment be set as why they are held? also job comment will only be set for held subjob or for all subjobs?
" * qhold can be supplied with subjob id range as argument to apply any type of Hold on all running subjobs in the range" -> why only running subjobs?
please mention the format for specifying subjob ranges
@anamika Thank you for the questions, pls find my answers below
Yes this is only for system hold, the use case is when there is an erroneous execution hook. Without holding the parent job, the server and scheduler will continue instantiating further subjobs which also gets held after 20 retries. Hence by putting a system hold on the parent array, the server and sched can skip the whole array till the admin or user corrects the hook.
the error will be same as with a regular job. I have now added this info in the doc
Yes the comments will be set as with a regular job. It will be set only for held subjob, however any uninstantiated queued subjobs will reflect the comment set on the parent which is the way subjob is displayed. I have now added this info in the doc
I wanted to mean the ones that have not yet finished. Thanks for catching this. I have now edited this info in the doc.
got it and makes sense. can you also add a line under system gold explicitly mentioning if user, other, or bad password hold (any hold other than system) is applied on subjobs then parent job array will not be held?
few nits:
āmore like to fail to launchā should be āmore likely to fail to launchā
āfinsishedā should be āfinishedā
Thanks for adding the error message for qrls. could you also update it to mention that subjob will remain in H state.
Hey @Shrini-h,
I have one major concern. When you apply an action to a job array, you apply that action to the whole job array (running and queued subjobs). When a running job is held, it is checkpointed. This means if you apply a system hold on the entire job array, youāll checkpoint the running jobs. Either that or you meant to just hold the queued subjobs which would not be consistent with other applications of the actions on the whole job array itself.
I have now realized to fix the ābugā I just need to enable system hold release of single subjob, hence I have reduced the interfaces and moved the explanation to the design doc page. Please let me know your opinions
@Shrini-h
I understand that you want to put a hold on the parent job array, but I am not sure that is consistent with how weāve treated job arrays in the past. In the past when you act on 1234[], you are acting on all of the subjobs in 1234[]. You are putting a hold on 1234[], so shouldnāt this hold all of the queued subjobs and checkpoint all of the running subjobs? Wonāt it be confusing to see the job array in state H and for there to be running subjobs? That being said, your intension is good. if one job is rejected by a hook, them most likely all of them will be since they are identical. The thing is that we usually donāt get in the business of second guessing hooks. They are meant for customers to do whatever they want. What if the begin hook rejected a subjob for a transient issue? A subsequent subjob may still be able to run.
I also think the -t option to qrls is unnecessary. We can directly act upon the subjob ids themselves. The reason we have the -t option to qstat is that it would be super-messy to print all the subjobs all the time. We added the option to print them if they were wanted.
It is infact consistent, for example when we do qhold 1234[] itās running subjobs continue to run and queued subjobs stay put in āQā state (kind of equivalent to a āholdā). Since only the parent job is set to āHā state, I mentioned it as holding the parent job only, and I donāt intend to change this behavior.
checkpoint not supported on subjobs. (which I feel should be enabled in future)
Since we are holding after 20 retries, it will be safe to assume its not due to a transient or ephemeral issue. I feel the benefit of holding the parent Job array (assume a very big array job) outweighs the benefit of being tolerant to transient condition. Holding the array for this transient condition also act as a way to notify it to the admin who can release the hold with qrls interface.
The first point of the interface introduces a way to act upon the subjob id directly
The second point (ā-tā) option is similar to qstat where we give admin a way to remove hold on held subjobs and their parent all in one go.
Currently we already have a way to unhold a held job array qrls 1234[]. This we can use to unhold the parent without touching the held subjobs (i.e the other queued subjobs can continue to get run). I have added this point to the interface now
Interesting. I thought doing a qhold on the parent array would act on all of the subjobs. Cool, it is consistent.
OK, so you are saying that if any one subjob failed to run more than 20 times, weāll hold the parent job array. Sorry, I must have misread the document. This seems consistent with what we do today with normal jobs. If the use case behind this change is to fix the issue with the infinite loop of running a job on a black hole node, then do we need to hold individual subjobs? It sounds like having the run_count per subjob is enough, since weāre going to hold the parent array when we reach 20 times.
Iām not sure I agree with the first point. Why canāt you act on 1234[1] to act on the subjob directly?
As for the second point, Iām still not sure I see the need to hold an individual subjob. Theyāre supposed to be identical. Why would we need to differentiate between them this way? Weād also have to make the scheduler understand more about individual subjobs. Right now it knows that if a job array is in the B state, it has control to make and run subjobs based on the indices remaining. If we are making them each more special, it has to have have more special case code to handle this case.
Not clear if you are agreeing or disagreeing.
subjobs already have run_count, but since they are not incremented between each retries, it never reaches the limit of 20 making server and mom loop infinitely (ex exec begin hook error). What Iām trying to propose with parent job is: once a subjobās run_count breaches the limit, immediately hold the subjob and its parent array (not after parent array reaching 20 times). Hope this clarifies my point
because that is currently not allowed. Currently qrls can be used on jobs and job arrays, but not on subjobs or ranges of subjobs.
so this first point is to enable exactly what you are asking for.
firstly, Iām guessing you have might have missed the context here. All the points under Interface 3 propose changes to qrls command which is used to release hold. Either that or Iām missing the context of your query. Iām stating the second point here when qrls is invoked with a system Held Job Array id with new "-t" option, prior to releasing the hold on it, all of its system Held subjobs are released
Anyways assuming we are talking about releasing of hold on subjob, I feel that your concern on having more special case code doesnāt withstand as, the subjobs in question here have already been instantiated and broken away from the indices remaining.
If we still donāt agree on the proposal of ā-tā option for qrls, I donāt mind dropping it. It made more sense when I initially proposed changes to qhold to enable it to hold multiple subjobs in subjob range.
I will wait for few days before dropping this, in case anybody else backs this proposal. Withoug this option, the admin will have to issue multiple qrls commands once for the parent array in āHā and again once for each subjob that are in āHā.
Iām sorry I wasnāt clear. I am agreeing with you. I think we should hold the parent job array if any one of the subjobs failed to run 20 times.
Iād rather you allow qrls to act on a subjob than add an option to allow it to act on a subjob. The new option doesnāt seem necessary.
I donāt understand why you canāt just act on 1234[1] or 1234 and not add the new option. Like I said before, qstat has the -t option for cleanliness. If you had many 10000 way job arrays, the qstat output would be very messy if you always printed the subjobs. This is why we only print the job array itself unless the subjobs are asked for. I donāt see the same use case for qrls
Maybe Iām misunderstanding how array_indices_remaining will be used going forward. Right now it contains all subjobs that are not running or done. Are you saying that when a subjob is instantiated, itāll be removed from the list? The attribute loses some of its meaning if that is the case. Now itāll be the queued subjobs that are still part of the array. This isnāt all of the queued subjobs though.
In any case, in most commands when you type 1234, you mean the entire job array and all of the subjobs. Iād rather this continue to be the case. Do a qrls on 1234 and the parent job and all the held subjobs get released.
Of course I still donāt see the use case for holding a specific subjob. They are identical jobs, so why would you want to treat one differently?
The subjob and the parent job array. i.e we will be seeing subjob and its parent array job in H state in qstat -t output.
I believe you are agreeing with qrls can be supplied with a subjob id as argument to release system Hold on it but not the point introducing ā-tā option. No worries, I have deleted it now, anyway its not necessary for my bug fix, I had proposed it thinking logical completeness that I had in my mind. Based on the current usage of qhold and qrls on job array, I logically associated it to mean stoping and enabling spawning operation of queued jobs, hence I wanted to preserve this meaning.
We already have the later which now is being modified and former is being introduced in line qrls can be supplied with a subjob id as argument to release system Hold on it
As stated before I have done away with the new ā-tā option.
Please feel free to skip this para, if you dont want to know my rationale behind the ā-tā option. I proposed it to preserve the logical meaning of current hold and release operation on array job. I also envisioned the possibility of enabling checkpointing on subjob in the future, then qhold of array job with -t could have meanāt hold the running subjobs aswell, whereas without -t (existing behavior) donāt spawn further subjob but let running subjobs continue. My now deleted proposal of -t option was analogous to this design. i.e qrls without -t option (kind of existing behavior) will re-enable spawning of new subjobs while not touching the held sujobs, where as qrls with -t option will also re-enable spawning and release its all held subjobs.
array_indices_remaining contain all the subjobs in Q state. including the instantiated subjobs that are still in Q. Since here Iām proposing instantiated subjobs going to āHā state, they wonāt be included in the array_indices_remaining.
I agree, I have modified the interface to reflect this.
Thatās because the specific instantiated subjob encountered the problem. Holding it will preserve the job object that failed to launch on mom for further perusal of admin or user.
Thanks for explaining, @Shrini-h. I have one last minor comment and a question. If weāre keeping the held subjob around so the admin can look at it, how will they? Via qstat? Will it look different from other subjobs? Iām not opposed to keeping it around, but I just want to make sure weāre doing it for a valid reason.
The question has to do with what happens to the non-instantiated queued subjobs with a held job array. Since the queued subjobs are basically a copy of the job array, will they show up as held? I think this is perfectly fine, Iām just wondering.
@bhroam
The valid reason is to persist the run_count value of subjob. As part of the implementation of interface 1, I will be removing many of the job purge operations specific to subjobs. With this a subjob in H state will be an independent job obj with itās own set of attributes. So yes qstat can be used on this Held subjob. For the other subjobs (that are not yet spawned) the will stay in Q state and a qstat on those will fake their attributes from the parent, which is the current behavior and I plan to keep it as it is.
Thank You @bhroam for helping me shape the design.
I will start with the dev work needed.
Meanwhile Iām thinking it would be prudent to have a new comment set on job array when it getās held due to subjob retrying more than 20 times.
Also Iām wondering if its better to have the job array held when the subjob fails to run due to provisioning fails, bad passwords, JOB_EXEC_FAILUID,JOB_EXEC_FAIL_PASSWORD,JOB_EXEC_FAIL_SECURITY.