PP-480: Job Equivalence Class Optimization

Please update the EDD to be explicit regarding what happens to other jobs in an equivalence class if a job is scheduled to run but the job run event is rejected by the server (for example, a runjob hook rejects the action).

I’ve made a small update to the EDD. This is due to a bug that was fixed in the feature post-checkin (PP-828). A suspended job could cause other jobs to not run even though they could. This was due to the fact that a suspended job has a select statement generated for it based on the original select statement and the vnodes the job is running on. This will highly likely result in each suspended job to be placed into its own equivalence class. The only case where multiple suspended jobs would be placed in the same class is if two jobs equivalent jobs are sharing the same vnodes.

Bhroam

@bhroam, I am not clear with when you say “This will likely result in a suspended job running in its own equivalence class.”. Is it not always the case that suspended jobs will go to their own job equivalence class?
Also I am still not clear if there will only be one equivalence class for all suspended jobs irrespective of their original job equivalence class or multiple for each job equivalence classes.

@anamika
Think of the case of two identical jobs sharing the same host. The special select spec is generated from the original select spec and the vnodes the job is running on. Since these two jobs are identical and both running on the same vnodes, they’ll have the same special select statement (and therefore the same equiv class). This isn’t something I foresee happening very often. I was just trying to be thorough when I mentioned it.

Bhroam

I was contemplating this change the other day and I was curious… Is this option compatible with “round robin” scheduling of queues? i.e. are there set of equivalence classes maintained “per queue” or is there one list that it’s working from after sorting and merging the queues (and therefore causes the equivalence “class” to effectively span across the queues). Nothing in the EDD appears to speak to this and a simple search of the comments above didn’t appear to mention it either (I apologize in advance if I missed something).

@arwild01
This feature is orthogonal to algorithms that determine the order in which jobs are considered (e.g., round robin). Jobs are still considered in the same order they were before. The difference is whether we actually do the work to determine if a job can run. After the first job in a class can’t run, we mark the class as can’t run. This means the next time we see a job in this class, we know the job can’t run. There is no reason to go searching through the nodes to find the same answer again.

To touch on the rest of your question, there is only a handful of cases that the queue a job is in matters to the equivalence class. When one of those cases come up, the queue is used to differentiate classes. The EDD does try and explain the logic used to create the equivalence classes. If you look starting at the second paragraph, it lists the attributes/resources used and at what times they are used. For example, queues are used if the queue is a primetime queue (there are other queue cases as well).

If you have any additional questions, I’m happy to answer them.

Bhroam