Preemption optimization - phase 1

Hi All,

As a step towards optimizing preemption performance in PBS, I am proposing a few changes on this page.

I request the community to provide feedback on the same.

Thanks,
Prakash

Thanks for thinking about this Prakash. A couple of comments:

  • Isn’t ‘preempt_order’ a scheduler-wide configuration (as opposed to a job-wide configuration)? If yes, then how about we make it an attribute of the server’s ‘sched’ object instead of sending it over the network for each job being preempted?
  • I don’t know much about the scheduler, so this maybe a bad idea, but why doesn’t the scheduler issue the preempt request asynchronously (assuming success)? I get that it has to issue a runjob request once preemption is successful, but can’t that somehow be included in the preemption request? (preempt these N jobs, if all goes well, run job X) ?

Hi Ravi,

Thank you for looking into the design, below are my replies.

The scheduler determines the preempt_order for each and every job based on the utilization of the requested wall time. I also hear from the field that there is a possibility that more parameters could decide the preempt_order apart from currently supported wall time.
So, it becomes necessary to pass per-job preempt_order in the batch request.

This I believe cannot be done because of 2 reasons -

  1. The scheduler needs to know if the preemption was successful or not. If not, it will try to find another set of jobs to preempt.
  2. It also needs to update it’s universe based on the response so that if there are other jobs apart from “this” job, they can be scheduled.

@prakashcv13 Re: Ravi’s second point - isn’t the way jobs get sent to the server now to run already asynchronous? The scheduler doesn’t wait for the server to try to run the job. Why wouldn’t the same process work for preemption jobs?

Sam,

In the current implementation, the scheduler waits for the server to preempt each job and if preemption fails for any job, it tries to find another job. to preempt.

Thanks,
Prakash

@prakashcv13 Understood, but why couldn’t the scheduler give the list of jobs to the server, then go on to schedule the next job in its list. If the server has a problem preempting one of the jobs then in the next sched cycle the scheduler would put together a new list of jobs, pass those to the server, and so on. Similar to the way a non-preemption job gets run. How often does a job the scheduler selects for preemption fail to get preempted?

@prakashcv13 I am trying to understand your proposal. Please correct me if I’m wrong but I believe the driving factor behind the changes you are proposing is the performance impact scheduler has while preempting a number of jobs.

As of today, preemption in scheduler is slow because of the fact that it takes a lot of time to decide which job is actually a preemption candidate. We can probably try to optimize that algorithm to cull the list of preemption candidates faster.
What you are proposing comes into effect after the decision on preemption candidates has been made. This proposal will make some part of scheduler logic reside in server (honoring preempt-order). I have few questions around the proposal:

  • You mentioned that scheduler can use a new batch request (IFL call?) to send a list of jobs with their preempt order to server and have server do the job of preempting the jobs. But, what will scheduler do after issuing this preemption request to server? As, of today it sends a run job request to server (if everything goes fine) but since in this case server is doing the work of making sure the jobs are preempted or not, scheduler won’t know whether to run the preemptor or not, unless it waits for a response back from server.
  • What will happen if server fails to preempt one/some job(s) out of many and how will it respond back to scheduler. Will it only respond back with the job-id(s) that it failed to preempt and then scheduler will try for another set of jobs?
  • What will happen in case there are limited amount of resources that can be released during suspension. Since with this new IFL call scheduler will send jobs and preempt order to server, it will not know which preemption method did server apply which means it will also not know how many resources should it consider as released in order to make correct scheduling decision.

We should also check that in the example you gave about Cray, where exactly is the time spent. Is the bulk of time spent in making IFL calls back n forth with server or is it that moms are taking a lot of time to suspend a job. Maybe running moms natively on compute nodes might help making preemption faster.

Hi Arun,

Thank you for looking into the design.
You are correct that the proposed changes come to effect only after the scheduler has a list of jobs to preempt.
The approach is two-pronged, this phase will look into the issue of reducing the time spent in making IFL calls back and forth with server.
In the next phase, the server will send parallel requests to MoMs for preempting jobs.

Having said that, below is my response to your questions -

Yes, it will be a new IFL call and the scheduler will wait for the response from the server.

For the scheduler to look for another set of jobs, it would need a “fail” list that would have the jobs that the server could not preempt. The failure response will list them. Design is updated with this information.

For the success case, the server will send ‘S’ or ‘Q’ for each job in the response.

Sam,

The scheduler as of now, cannot make a scheduling decision for the next job until and unless it will know which resources can be freed based on whether preemption was successful or not. It is a possible event that some of the jobs were preempted and some failed, and even after retries (5), the high priority job could not run. The amount of resources to be freed also depends on the preemption method used.

Based on all the above, it is not possible for the scheduler to go ahead and schedule other jobs just after sending the preemption request, currently and after the proposed changes.

If we are modifying the IFL interface, we will need to bump the version number on libpbs.so.

If this is the case, it is very important to know where exactly is the slowness in preemption. Have you tried some tests where you see IPC between scheduler/server being the cause of hundreds of seconds of delay. If not, we should probably try such tests and measure the time spent in IFL calls.
I believe that in case of preemption, IFL takes minuscule amount of time communicating between server/scheduler and most of the time is consumed in mom while processing this request and in server while trying to reconcile the responses from mom.
While creating an IFL call which can take multiple preempt requests sound like a fine idea, my worry is that if we create this new IFL call then we will be adding a lot of machinery to server/scheduler and will have very little to gain from it.

I don’t think we are adding a lot of machinery, and, to go on to the next level (MoM executing preemption in parallel), it is necessary that the server has the list of jobs to preempt.

@prakashcv13 Perhaps I misunderstood the way the scheduler handles non-preemption jobs. Does it not just assume the resources it told the server to allocate to the job are assigned to that job and then move on as if the job were run successfully? It seems like the big gain for the scheduler with this change would be a one shot delivery to the server of a list of all jobs to be preempted and then not waiting. If there is a problem preempting some job in the list the high priority job will be considered again in the next sched cycle, which should start as soon as the current cycle ends. Is this not possible in the case of preemption jobs?

The primary issue here is that the scheduler serializes the preemption once it has the list of jobs to preempt. If we send the list to server, it can (since the comm from server to mom are asynchronous) do this in parallel. Granted that the actual preemption work at the mom can be slow on certain platforms, but when the jobs are distributed across moms it would be executing in parallel, giving us a net benefit.

So, I do believe that a list of jobs sent to server would speed up the overall performance of peremption (atleast in the where there is more than a few jobs involved)

Sam - below is my understanding of why things are as they are -

It does, as the scheduler is having concrete information about the resources available and while running a job the scheduler is “reducing” the amount of resources.
During preemption, the scheduler is in the process of “freeing” some resources, so it is dependent on the result of the preemption in order to be sure of the amount of resources it has to schedule other jobs. If there is a failure in preempting any job, it will try to find some other set of jobs to preempt. So, it makes sense for the scheduler to wait for the response.

@smgoosen
What we did with non-preempting jobs will not work here. The problem isn’t preempting the jobs. The problem is where do we run the high priority job. If we just assume the preemption of the jobs worked, we’d run the high priority job on those resources. If any job failed to be preempted, we’d oversubscribe those nodes. We really do need to know if preemption has been successful before we move on.

Bhroam

Hi – I’m coming to this party late, but… If the motivation for this change is performance, we should start with real performance data. How fast is it now? Where is the time spent? What is the performance goal?

(If it’s not performance we’re after here… Nevermind.)

Yes i believe performance is the goal here.
Agreed we should do a performance test - some data is available - more can be tested. The challenge is going to be to create the right test setup, of course (if we do a regular simple test, then we might miss the issue)

@bhroam So assuming we do what Ravi suggests above:

Why would the server run job X if it knows one or more of the preemptions failed?

I was talking about the async preemption idea. In the case of running jobs, we can move on. Running jobs is not a two step process. We run a job and assume all the job’s nodes are busy for the cycle. If the job failed to run, the scheduler would look at it again the following cycle.

In the preemption case, there is a two step process. Step 1 is to preempt jobs. Step 2 is to run the high priority job. My point is that we can’t move onto step 2 until step one has finished. If we did, we’d over subscribe the nodes while the preemption was happening.

The only way to make this async for the scheduler would be to create an IFL call that would preempt jobs and then if was successful, run the high priority job.

This is doable, but we’d lose the robustness our current algorithm has. If we fail to preempt jobs, we will go back and search for more jobs to preempt. If we implemented the above IFL call, we’d only search once and then move on.

Is the loss of robustness worth the speedup?

Bhroam