PP-465: qrerun timeouts when big job files are being copied from MoM to server

Hi All,

Recently, there has been a bug filed against qrerun which says that if a job that has a large job files is rerun, qrerun times out.

However, in the background the server continues the process of rerunning the job.

There is a server attribute job_requeue_timeout which determines the timeout period for rerunning the job.If this attribute is not set, we default to 45 seconds.

On issuing a rerun, the execution host sends the job files back to the server. If these files are huge, it would take a while for these files to be copied from MoM to the server.

This is where the timeout comes into picture. The description of this attribute in the guide is limited to the explanation that it is the time allowed for a job to be requeued.

What would happen if the timeout is hit is not mentioned in the guides.

So, all the timeout does for now is it throws a spurious error that causes problems for the client w/o anything having really gone wrong.

In the event of a network failure, the job will get requeued by node_fail_requeue.
Does anyone have any thoughts on when the job_requeue_timeout is necessary?
Would anyone have any objections if we removed the job_requeue_timeout functionality completely?

A detailed analysis can be found here.

Thanks,
Prakash

okay I’m new to how job requeue works so please forgive me if I ask silly questions :slight_smile:

From the description above it seems that job_requeue_timeout is making sure that qrerun returns the prompt instead of making user wait forever (or for a very long time unless job file is copied).

I’m not very sure why we are relating network failure here, If a network failure happens then Mom will never be able to copy anything back to server (including job file mentioned in step 8) right?

If spurious error message is the problem then that can be fixed by throwing an appropriate message.

Returning to the user in short order seems like a good thing; perhaps what’s missing is the occasional periodic reassuring log message to tell the user that the copy is still in progress.

Network failure was mentioned as this was another reason (apart from copying of big job files) that a rerun could get delayed/timed out.

So, the timeout really doesn’t serve any purpose. Even reporting that the rerun process timed out is wrong/misleading because that never happens.
In the event of big job files transfer - the copying continues.
In the event of network failure - the job is requeued/rerun on node_fail_requeue.

Also, the description in the guides is also incomplete for job_requeue_timeout, so we concluded that we can remove it.

I too believe that having the user wait for ever until the job reruns, may not be correct. The rationale was that the timeout was not serving any purpose, so can be removed and having big job files may not be a frequent event.

How about the suggestion below that we have server recording in the logs periodically that the rerun is happening OR we can have qrerun itself doing that on the console.

okay I understand that the error message thrown when qrerun times out might be misleading but isn’t it this timeout that is causing qrerun to exit? If so, then it does have a purpose.

We can probably improve the error message and server can continue to log the right thing about what is happening with rerun request in its logs.
I also feel that server does not have to log anything periodically until the job gets requeued again. If the job that is being re-run has a different state/substate than “R/42” then that itself can also be considered as a message to user OR just have server set job comment stating that job is being requeued.

Arun,

With the suggestions that you have given, it makes the timeout even more redundant. Server can simply add a comment to the job, start the rerun process and return a “success” acknowledgement back to the client.

Monitoring whether the job is successfully rerun or not can be done similar to how we do after altering a reservation.

Thanks,
Prakash

Prakash,
The scheduler owns the job comment. If you set it, the scheduler will change it on the following cycle. A cycle will be immediately triggered because a job has ended. This means your comment will be immediately changed. A log message is a better choice.

or a new substate?

-Prakash

I’d not say that timeout is redundant. If server is able to requeue the job within few seconds (which is mostly going to be the case) then waiting a few seconds isn’t a bad thing.

What would be bad is that qrerun returns with a message that it is in progress and then user will keep on doing qstat of the job just to make sure it is queued again.

Who knows someone could have written a script that does qrerun and would be counting on the fact that it would block for the time job gets requeued.

I presume that we have agreed upon the fact that the current behaviour of the timeout is not serving its purpose. However, it can be used for either returning after the job is successfully requeued OR display a message that the rerun process is in progress. My views on that are -

If the job doesn’t get requeued within a few seconds, we leave only one option for the user, that is to monitor using qstat.

The job eventually gets requeued anyway, after the files are transferred OR on a network failure.

And as you say, that the job will mostly be requeued within a few seconds, the user wont be busy doing qstat of the job for long.

wrt scripts, the amount of time a job takes to get requeued is not fixed even now, so writing such a script will never be expected to produce consistent behaviour. If we remove the timeout, we are being consistent that returning of the prompt after qrerun is not a guarantee that a job is already requeued.

No, I’ll say it again, I don’t think that timeout is not serving any purpose. It makes qrerun return only when job is requeued which happens most of the time.
If we return immediately then only option left to the user is to do qstat and with this we save a timer in server and leave it to serve multiple qstats of a job, which is a costly operation. So, instead of returning immediately it is better to give it enough time to requeue the job.

Well, this is what we expect NOW, that such a script would always produce inconsistent behavior. But, 90% of the cases it is just going to work fine and probably has been working fine in the past. If you remove timeout then those already running scripts would just assume that qrerun is successful (even though it may not be) and perform subsequent operations on the job (like movejob, holdjob etc.) which will start to fail because requeue would still be in progress.

I am of the opinion that we cannot define what is “enough” for a job to get requeued/rerun. It depends on the size of the files being transferred, network, etc…

The current scripts are already going wrong by thinking that when a timeout happens, the rerun has failed, which is not the case. The server continues the process of rerun. Those scripts need to be corrected anyway.

You are right that we can not define what is “enough” for a job to get requeued but surely the timeout that is currently there is better than having nothing, because by doing so we make it work for most of the cases.

Now you are talking about that 10% of the cases where it takes time to requeue the job and qrerun times out. qrerun has been there for years. How many instances in the past have we seen where a user reports that qrerun is not doing the right thing and causing them troubles.

I will put it this way… suppose we decide to keep the timeout and set it to a default of n seconds. What does it guarantee?
This issue (PP-465) was seen while running a script that tried to rerun a job that has a output file of 3MB with timeout set to it’s default of 45 seconds.
My point of view is whatever value we set the timeout to, we are not assured that the job will be rerun within that time period.

We will have two behaviors if we keep the timeout -

  1. If the job is rerun within the time period, qrerun returns without any message being displayed.
  2. If not, we we display that the job rerun is in progress and then return.

I consider this as an inconsistent behaviour. Either we should guarantee that the return means a job is successfully rerun, or say that the rerun is in progress.

Also, in theory, we cannot arrive at the figures like 90% or 10% given any value of the timeout (do we still need to call it a timeout?).

IMO, even if a problem is not being reported, we do know now that the problem exists. It might have so happened, that the existing scripts/users saw the timeout message and didn’t bother to qstat the job, and considered that the qrerun failed.

Our guides do not define what a “failed” qrerun is.
Our code says there is no failed qrerun, a job always gets requeued.

Well I do not understand why having a timeout is such a bad idea. I don’t understand that making the user wait for some time and then letting him/her know that operation is still in progress is a problem in your terms.

Thinking further on this. It is not only commands that will be affected if timeout is removed. Scheduler also issues a requeue command to server when preemption happens.
Now if rerun request just returns successfully without waiting at all then in most of the cases there is a good amount of chance that two things will happen:

  • Scheduler will assume the job is preempted and then run high priority job which means that unless the copy is successful the node shows up as over subscribed.
  • After running the high priority job scheduler will try to run the preempted job and then that will fail because there is still requeue that is going on in background.

I’m not saying that this does not happen today but because of a timeout the chances of it happening is minimal. But, if timeout is removed this is more likely to happen when preemption with requeue is done by scheduler.

because it is not really a timeout by the definition of it. It does not abort anything when the time period expires.
In my terms, if we are introducing an attribute calling it a timeout, once the time period expires, we should abort whatever the timeout was introduced for.

This becomes a temporary use-case for the timeout until the above race condition is resolved.

@smgoosen, @jon - we would need your inputs on deciding the behaviour.

@arungrover and I had an offline chat regarding the behaviour. We agreed to disagree on the functionality/definition of a timeout.
However, in our discussion, I came to understand that the scheduler assumes that when pbs_rerun returns success, scheduler assumes that the job has been requeued. I also thought that the scheduler can requery the server to find the actual state of the job in case we remove the timeout, Arun clarified that is not possible as the scheduler will start making wrong decisions.
So, the decision is as follows -
The timeout will be retained, however with a different name - and the functionality will be that the client will be updated that the qrerun is still in progress.
Arun, please add/correct anything that I have missed out or misunderstood.

Thanks,
Prakash

1 Like

Arun is correct that if pbs_rerunjob() returns successful, the scheduler will consider the preemption a success. It will move forward and run the high priority job on the newly freed nodes. Those nodes will be oversubscribed. This is still true today if the timeout is exceeded.

I think you two came up with the right solution. The scheduler is between a rock and a hard place. There is no real good answer. I see 4 different ways we can handle this case.

  1. Remove the timeout and wait for the operation to complete. This will ensure no oversubscription will happen, but it will stall scheduling
  2. Remove the timeout and not wait. This will always cause always oversubscription unless we poll the server and wait until the operation is done. Then we’re back in #1 above with stalled scheduling.
  3. When the timeout is exceeded, return failure instead of success (even though the requeue will still happen). The scheduler will mark that job as a bad job and go find another. No oversubscription will happen. We will have a better chance of finding a good set of jobs to preempt. The downside is that we’ve now preempted too many jobs. Those jobs will likely restart right away, but that isn’t guaranteed.
  4. Do what you two are suggesting. If the timeout is exceeded, return success. This will cause oversubscription when the timeout is exceeded, but hopefully there only be a handful of these cases.

I think #4 is the best of all of the bad options.

Should we file a doc bug to inform the user about this possibility? My first instinct says yes, but there is a lot of corner cases we don’t put in our docs because they would overload the users. Is this corner case important enough to inform the users about?

Now there are a lot of meanings to oversubscription. In this case, the original job’s processes will be dead. The compute resources will be available. The network will be in use copying files. The server will report resources_assigned higher than resources_available.

Bhroam

@bhroam, Thanks for confirming the best of the available options that we have.
I will go ahead and write the EDD.
Regarding doc bug, I would be in favour of letting the user know about this corner case.

Thanks,
Prakash

Hey Prakash,
Thanks for writing up an EDD.
Could you add the message the qrerun command will receive when the timeout is hit?
Also, instead of just saying the value is 45, could you say 45 seconds to be clear.

Bhroam