Hi Bhroam,
Thank you for reviewing the EDD. I have already added the new message in the second last bullet point.
Last bullet point is corrected to say 45 seconds.
Thanks,
Prakash
Hi Bhroam,
Thank you for reviewing the EDD. I have already added the new message in the second last bullet point.
Last bullet point is corrected to say 45 seconds.
Thanks,
Prakash
Thanks for making the change. I totally missed the message.
I have one question. What happens if you give multiple job IDs to be requeued? Like
qrerun 12 13 14
The command should send each job individually. What does qrerun do when one or more of them hit the timeout?
Bhroam
Current behaviour will be continued, for each job the message will be displayed.
Log of the offline discussion -
@smgoosenâs email
Hi Prakash,
So just to be sure I understand why the timeout is usefulâŚ
For a normal qrerun request the timeout serves no purpose other than to return a spurious message and cause tests to fail.
In the case where the scheduler issues the request (e.g. for preemption), the scheduler waits (right?) for low priority jobs to be preempted before telling the server to run the high priority job. If the low priority jobs get requeued quickly the scheduler somehow knows this and goes on to run the high priority job. If the job doesnât get requeued quickly enough the timeout expires and the scheduler runs the high priority job causing oversubscription. Assuming a) we donât want to just wait for the requeue(s) to finish and b) âsomeâ oversubscription is OK, then the purpose of the timeout is to allow a delay before the scheduler oversubscribes the node, right? If so then perhaps the name of the âtimeoutâ should be something like job_requeue_delay? And the log message would be something like âqrerun: requeue delay for job . has expired, requeue still in progressâ In the case where a hook issues a rerun commandâŚ?
Does the hook infrastructure interpret the timeout as a failure and return or�
Would it be possible for the âdelayâ to only apply if the request is coming from the scheduler or is it useful at other times that I havenât seen mentioned yet?
Thanks,
Sam Goosen
@arungrover 's reply -
One more thing Iâd like to mention here is that⌠-
If our purpose is for tests to succeed, then qrerun âWforce option can also be used. This was no timer is registered and prompt is given by to user (in this case test script) instantly. What this also means is that no output files will be copied. If the test case is making use of output file generated from jobâs previous run, then itâs not beneficial to use âWforce option.
You are right about its usage in scheduler. It provides enough time for server to requeue the job before replying back to scheduler. Oversubscription might happen even after timeout expires but the timeout acts as a cushion to handle requeue impact in most cases.
Regards,
Arun Grover
Iâm not sure delay is a good term, but I like it better than timeout. Prakash and Sam are correct that timeout makes it sound like weâre going to give up after a certain time. We have precedent in PBS for the term delay. The kill_delay attribute is the period of time between sending job a SIGTERM and then a SIGKILL. If we are to take that as delay means the amount of time between two actions, is it the correct term for what weâre doing here?
Unfortunately I have no better term, so I say we use Samâs name.
The oversubscription we are talking about isnât true oversubscription. True oversubscription is to start two jobs on the same node at the same time. Theyâd fight for cpus and memory and one would probably die(or thrash). In our case, the job is dead. The cpus and memory are free. The network is just in use. If we oversubscribe in this way once in a while, I donât believe itâs the end of the world.
Sam, your opinion is what matters here. How will the customers view this? Will they be heading our way with torches and pitchforks if we occasionally oversubscribe in this way?
To answer your other question about if we can do something special just for the scheduler, the answer is yes. The server knows if a request is coming from the main scheduler (not multi-scheds). I personally like the idea of the delay for users as long as we tell them what is going on. Like Arun said, if they want something fast, they can always use -Wforce.
Bhroam
@smgoosen
Even though it is possible to do something special for the scheduler, I would not like to have a different behaviour for the same API depending upon from where the call was made. It also adds unnecessary complexity.
Thanks,
Prakash
I have changed the name of our attribute.
Thanks,
Prakash
@Bhroam: I believe the type of âoversubscriptionâ weâre talking about here is acceptable
@prakashcv13: Thanks for changing the name of the âtimeoutâ, the rest of the design looks good!
Re: âRenamingâ an existing server attributeâŚ
Please remember that we should strive to maintain 100% backward compatibility. Generally, this means we strive to ensure existing scripts (and configurations) work with newer versions of PBS Pro (without change). This is really important for users of PBS Pro, as many users have tools (sometimes tools that they did not write and even tools they cannot modify) that interact with PBS Pro. In cases where a PBS Pro parameter is wrong (or wrongly named), it is preferable to deprecate the old one and introduce a new one (and to continue to support the old one for several more versions of PBS Pro).
In very, very rare cases, it may be OK to remove an attribute name completely (not sure if this is such a case).
Thanks!
- bill
This is good point.
So if we choose to deprecate the old option then itâs not only the attribute name that we retain, we retain the whole functionality (which includes printing the old message and exiting qrerun with an error number) because those existing scripts/tools would be dependent on the exit value/error message of qrerun?
How about not doing any code changes, but changing the documentation to say that this server attribute is deprecated. We will continue to have this âdelayâ, but non-configurable.
The documentation change will be that the job will be re-queued and the timeout message is an indication that the rerun process is taking time.
Of all the suggestions so far, I like this one the best.
To me, log messages are mostly not part of the officially supported interface, so changing log messages is generally OK, especially if we improve them :-). So, I am also OK changing the log message to be more meaningful (if it needs that).
Thanks!
Thank you @billnitzberg, by message I meant the message that is displayed on the command line, and we should not be changing that. So, I believe only the documentation change is enough.
I will update the EDD accordingly.
Thanks,
Prakash
EDD is updated accordingly.
I believe we can deprecate the confinable attribute, not change the error return, but still update the message to be less confusing
Ok, Sam. I do understand that the error message is spurious and we need not be prisoners to a wrong message. If there are customer scripts that depend on this exact message (which might be very few), will need to be updated. I have updated the EDD.
EDD update looks good!
@smgoosen, @prakashcv13 - I have 2 questions:
Thinking about this error message in particular â it seems very rare and very unlikely to be used inside tooling that users have built around PBS Pro. So, although it does have the potential to break backward compatibility, I believe itâs OK to change this message.
Deprecate means âcontinue to supportâ, but inform users that we will likely drop support in some future release. That means we can remove it from the docs, but the setting should continue to work (in the code).
In the PR for this ticket, I also suggest we change the message to:
Response timed out. Job rerun request still in progress.
IMHO, this is more clear and concise.