As this cluster has evolved, we are using a mix of hardware and operating systems on the compute nodes. These are split into separate execution queues. What I am missing is how to properly route to one and overflow to the other.
Using the max_queued limit gets close to a solution “max_queued_res.ncpus = [o:PBS_ALL=260]” except for this counts held jobs as if they were running. The queued jobs in the routing queue that could be running are stuck waiting for held jobs to finish.
I would expect that someone has a clean solution to this configuration. If this requires a queuejob hook, that’s fine. (It would be even better if someone has said hook available.)
The Big Book shows nice ways to split out the queues by core count, or walltime, or memory size. What I did not find was how to feed one queue and when full, move on to another queue.
I think the limit you want is max_queued_threshold instead of max_queued. The max_queued_threshold limit deals with jobs in the queued state only whereas max_queued is the total number of jobs. There is both a max_queued_threshold and a max_queued_threshold_res.
I tried using the suggested parameter of “queued_jobs_threshold_res.ncpus = [o:PBS_ALL=8]” and found that held jobs were still getting in the way of other jobs that could and should run.
It sounded like a good idea and was certainly worth a try. The queued jobs here seem to be blocked by the held jobs.
The demonstration by scc looks like “queued_jobs_threshold” has the same problem with held jobs.
This overall problem strikes me as if I’m missing something fundamental on how to properly configure the routing, or more precisely, how to configure the execution queues so that routing behaves as desired.
I THOUGHT I had a good handle on how this works but that is clearly not the case.
Can you expand a bit on what your end goal is? There is something we call “internal peering” that might help you, depending on what exactly you are trying to achieve by having these jobs occupy different queues. In the “internal peering” scenario you could set up the queues to only take jobs from the routing (or execution, really) queue when they are able to run.
-Hardware-
There is a mix of two different hardware families (DL380 gen8 & gen9) and two different operating systems (RHEL6.4 & RHEL6.5), soon to be followed by RHEL7.3. The resulting permutations are mapped to execution queues for gen8r64, gen8r65, gen9r65, and there will be a gen9r73.
-Software-
The applications are a mix of large parallel ANSYS jobs and a large quantity of single thread (single core) jobs. We have evolving versions of ANSYS. The applications are only allowed to run on certified platforms. Each application and version has a routing queue that feeds only to the approved execution destinations. In the case of single destinations, the outcome is easy. In the case of multiple destinations, there is a preferred destination followed by a acceptable alternate destination. That is the situation that drives my original question of how to route to a primary execution queue while spilling over to an alternate execution queue if resources are available.
The large parallel jobs and the small singular jobs need to run in a shared environment and play nice.
CPU (or core) allocation seems like the right way to spread the mix of jobs in the execution queues and that was my original intent with this arrangement. The built-in configuration parameters seem to get really close to a working solution except for the held jobs getting in the way.
Simple isolation of destinations would work fine except for the folks who would then have a problem with non-shared access, “Hey, how come my jobs are queued when there is room to run on those other nodes?”
This question remains unresolved. I needed to get the system back to a stable, unattended mode so I configured routing to single execution queues with no limits. It works but it also misses the idea of packing the available execution queues.
If someone has a fix for this idea, please let me know.