PP-829: Preemption via deletion

I’ve added a design document for PP-829. The scheduler will now have the ability to preempt jobs by deleting them. The preempt_order sched_config option will be extended with the ‘D’ letter to signify deletion.

@bhroam I do not have any comment on the design proposal. I have a small question though - Will this mean that when scheduler issues delete command to PBS it will be using “-W force” option? I think without that option there might be some delay in deleting the job and scheduler will have to wait.

Thanks for your comment @arungrover
You bring up a very interesting point. If we do a normal qdel, the job might take some time to be deleted. The scheduler might be trying to start the high priority job before the preempted jobs are deleted. Using qdel -Wforce sounds like a good solution to this issue, but I think it’s just hiding the same issue under the rug. Before the qdel -Wforce returns to the scheduler, the job will be purged from the server’s database. The server shows the resources as immediately being free, but the mom is still doing end of job processing.

I see three issues
First is cleanup hooks. If we start a new job before the cleanup hooks are finished, the new job might be cleaned up.
Second is begin hooks. If we have a begin hook that makes sure the status of the node is prep’d for the job, it might clean up the old job. This might not be too bad. It would be worse if the cleanup and the begin hooks clash
Last is the Cray/cpuseted machines, or machines running cgroups. We previously told the operating system to carve out part of the machine for a job. Part of end of job processing is to release those resources back to the machine. If the scheduler runs the high priority job before the resources are returned, the new OS request will be rejected (e.g., ALPS reservation). This by itself is bad, but it gets worse. We’ve just deleted jobs jobs to run our high priority job. The runjob of the high priority job fails. The newly freed resources will likely be filled by new jobs. On subsequent cycles, the whole process will start again.

The only wait I can see us not falling into any of these traps is to wait for the deletes to end before running the high priority job. This unfortunately slows the scheduler down.

Bhroam

Hi Bhroam, thanks for putting this together!

I had always thought that qsub -c n only had an impact on periodic checkpointing, but I see that it does indeed prevent checkpoint_abort as well. Extending that line of thinking from qsub -r n and qsub -c n, what are your thoughts on introducing a new interface (possibly as part of a separate project) to mark jobs as being non-suspend-able since once this work is finished all of the other methods can be allowed to fail acceptably since the admin can always have “preempt by delete” as a backup?

Hey @scc
Extending PBS to not allow jobs to be suspended is doable. I view it as a separate (but related) RFE. Another separate but related RFE is to parallelize preemption. If PBS is going to have to wait for jobs to be deleted, it will be best to submit all the delete requests at once and then poll the whole lot. It would require some server and IFL work, but the same idea could be applied to checkpointing and requeuing jobs. Of course waiting for a preemption method to complete like this will cause problems with our error detection. Do we wait for job to requeue before we delete it? When do we give up and move onto our next preemption method?

Any or all of these could be clubbed together and implemented at the same time.

Bhroam