Preemption optimization - phase 1

Thanks for explaining Bhroam, I think it might be worth exploring how often Step 1 fails (preemption of jobs the first time). If for most cases, Step 1 passes in a single try, then it might be worth optimizing for the common case.

If we want to maintain the robustness, here’s a bad idea that you will hate: how about when Step 1 fails, the server sets an invisible attribute on the highp job to the list of jobs that it could not preempt, so that in the next cycle when the scheduler will try to preempt jobs for the highp job again, it will ignore the jobs set on that attribute.

We could also try to just preempt the job again without any information from the past (except maybe a max_preempt_attempts), maybe the world changes enough that we can find the right set for it next time, or maybe it’ll just run the next time. So, we could try doing a POC to see if how the wait time of highp jobs gets affected if we simplify things for better performance.

1 Like

Has there been any analysis into how often Step 1 fails? Was the POC Ravi mentions ever tried?

@Bhroam I’m not sure I understand how this is less “robust” than running regular jobs. In either case if there is a failure, whether it be a node down for a regular job or a job that fails to be preempted, the job that’s trying to run goes back in the queue until the next cycle.

In the case of running jobs, the scheduler does not know the job did not run until the next cycle. It gets returned to the queue on the server. The scheduler doesn’t know. For running jobs, this is perfectly fine. All that matters is that it assumes the resources used by the jobs are not available. We have less resources available to run jobs. No over subscription happens in this case.

In the case of preemption, we are going in the opposite direction. We are freeing up resources. If a preemption fails, then resources are still being used. If the scheduler just assumes all jobs were preempted correctly, it will run the high priority job on those still-used resources. If any of the low priority jobs failed to be preempted, we now have over-subscription.

As for analysis, I don’t believe this has made any further process in the last month. @prakashcv13 can correct me if I am incorrect.

Bhroam

Hi @bhroam, @smgoosen - I am working on the POC to compare the before-and-after performance. The IFL that I am currently implementing is as per the design that is already posted. I will be able to post the findings by end of next week.

Coming to the question of analyzing how often step 1 fails, I have not worked on that because I feel that it is better to transfer only the logic of preempting the jobs to the server.

I take it back, @arungrover is working on making the scheduler choose its set of preemption candidates faster. We occasionally work on scalability on the scheduler, and this time we’re working on preemption. It should be checked in soon.

Hi All,

I finally have some information to share. After shifting the logic of preempting the jobs from the scheduler to the server, and implementing the new batch request as per the proposal, I see an improvement in the performance.

The implementation that I have done so far only suspends the jobs (yet to implement checkpointing and re-queueing).

The test that I performed submits a set of normal jobs that get preempted by one express job which uses all the ncpus in the complex. The test records the time taken to preempt the normal jobs. The number of normal jobs increases from 1 to 149.

I have attached the test script and the output to the Open Confluence page.

As seen the time taken to preempt the jobs has reduced by a good percentage.

Thanks,
Prakash

Hey @prakashcv13

Thanks for doing the testing. I’m somewhat surprised we get a 3x speedup when preempting 150 jobs. I would have thought it would be smaller.

My suggestions are the following:

  1. don’t put preempt_order in the IFL call itself. Just move it from the sched_config file to the scheduler object. This way the server and the scheduler both have access to it.
  2. The IFL calls don’t return a string with data embedded in it. They return a structure. Either create a new structure, or return a batch_status. It would be a kind of hoaky version of the batch_status since you’d have to return whether or not the preemption failed or succeeded as an “attribute”.

Hi @bhroam,

Thank you for going through the test and the results and the feedback. Below is my understanding -

As far as I have understood how preemption works, scheduler finds preempt_order for each job individually based on the preempt_order value in sched_config. The server would not need the configuration setting, but the order in which it would need to try preempting each job which the scheduler dynamically “calculated” at the time of running a high priority job.

The implementation that I have done so far is using a new structure in the union of the batch_reply structure.

There is only one global preempt_order. It just affects a job differently depending how much time is left in the job. My suggestion was to move preempt_order to the server. It would most likely work best in the sched object. Once the server had the preempt_order, it could calculate how much time is left and find the correct preempt order for a job.

Can you update the document to how you plan to implement this?

Thanks,
Bhroam

Hi @bhroam

I am not sure, if we would gain any value-add by doing this. According to me preempt_order is more of a scheduler parameter and your suggestion is to keep it as-is. If we just move the logic of determining the preemption order to the server, it will not be of any use as a scheduling configuration parameter. So, I am of the view to keep the logic in the scheduler itself.

Thanks,
Prakash

The only purpose of preempt_order is to determine which preemption method is used for jobs to be preempted. If we go with your suggested method, the only thing the scheduler will use it for is to pass it to the server. Why have this pass-through? Why not move it to the server where it is needed? The scheduler won’t care any more once it isn’t determining what preemption method is used.

Bhroam

Hi @bhroam,

I agree that the scheduler need not send the preempt_order in the request, however, the scheduler uses the preempt method to determine the state of the job in the “_end” functions so the response from does need to have the preempt_method.

Also, I propose that instead of making preempt_order a part of the scheduler object, why not move it completely to the server and make it something that can be set through qmgr?

Am I proposing something that would affect a lot of test scripts?

Thanks,
Prakash

@prakashcv13
You are correct, the scheduler does need to know how the job was preempted. If it was suspended and restrict_resources_to_release_on_suspend is set, it needs to know not to release all the resources. If the job is checkpointed, it needs to know now to release its exec_vnode.

So yes, the return batch_status will need to tell the scheduler how the jobs were preempted.

I totally agree with moving preempt_order completely to the sched object. You will need to remember upgrades. There is code in the scheduler which basically does a qmgr (the internal IFL call of qmgr) of certain attributes during the first scheduling cycle. If preempt_order is in the sched_config file, you’ll want to set it in the sched object.

Bhroam

Hi @bhroam,

I have updated the design as per our discussion, request you to review it.

Thanks,
Prakash

Hey @prakashcv13
Thanks for updating the design document. I have a few comments:

  1. We no longer add log messages to design documents. Please remove them
  2. There is no real reason to pass the number of jobs being preempted in the new IFL call. This can easily be calculated.
  3. None of the other IFL calls take a special structure. Although most of the other IFL calls only work on one object. The only other IFL call that works on multiple objects is pbs_statjob(). It takes a comma separated list of names. Consider doing that. If you choose to not do that, at least NULL terminate your input list so you can count it.
  4. Consider using the batch_status structure as a return structure. All of the other IFL calls use it. I understand why you didn’t use it though. It doesn’t fit 100%. It returns an object with a list of attributes. A preemption method is not an attribute. We could make it a “dummy” attribute for the purposes of it.
  5. The new IFL call signature doesn’t have a return value. I think it is supposed to be your new structure.
  6. I didn’t see how you return preemption failures.

Thanks,
Bhroam

Hi @bhroam,

Thank you for the review. I have updated the design as per the comments.

Removed.

Agree, removed.

I choose the latter option.

I would prefer to do it in the way I have done it. I do not see any merit in working around the logic to make preempt_method look like an attribute for the sake of being generic. What do you think?

Updated.

Added the detail.

Regards,
Prakash

Thanks Prakash for updating the design document.I have a few query as below:-

1.If Admin is not set the preempt attribute through qmgr ,what is the default value set and visible in qmgr -c “print sched” output?Also if it set by admin will that change reflect in qmgr -c “print sched” output?
2. How this change would effect the PBS upgrade behavior if we are upgrading form 18.x to 19.x?
Do we need to preemption attribute value in sched_config or through qmgr for successful upgrade?
Is any extra step need to add in Upgrade step to reflect this change?
3.The log message inform the user that setting of a <preemption_parameter> through sched_config is deprecated. would be logged in scheduler at default scheduler log level ?
4.I assume that for this release we support backward compatibility for preemption attribute in sched_config file ?

Hi @Klovely,

Thank you for reviewing the document, below is my response.

For all the parameters that are moved, if the admin is not setting them explicitly, PBS will set them to the default values. The default values for all these parameters are already updated in the document.
Yes, in both instances, the values will be displayed in the print sched output.

I have updated the document with this information.

Yes.

Yes, we are just deprecating as of now for backward compatibility.

Thanks,
Prakash

Hey @prakashcv13
Thanks for updating the document. I only have a few of minor comments.

I’m a little confused on how the new IFL call will work. If it only returns an int, where is the per-job return value coming from? Is it in the input parameter? I don’t like this because there is no way for the IFL call to know what is freeable and what is not. I’d return the new structure. I don’t find the int very useful. You need to iterate through the jobs anyway, so why does it matter if we know if the call was successful or not.

If you stick with the int return value, what does success and failure mean? Does success mean all jobs were successfully preempted? Does it mean at least one job was successfully preempted? Does failure mean all jobs failed to be preempted? Does it mean at least one job failed to be preempted?

For the moved attributes, docs will want to know who can set them and who can read them.

Speaking of the new attributes, do you want to mention that there is a change in who can set them? Before root/administrator was required to edit the sched_config. Now a manager(?) can modify them. I like the new behavior, but I don’t know if we should point it out or just saying in each interface that a manager can set them is enough.

Bhroam

Hi @bhroam,

Thanks again for the review.

The API need not be freeing anything, it will be handled by the caller.

I have updated the design with this detail.

Added the detail to the design, separately for each parameter.

Thanks,
Prakash