How to route to multiple execution queues?

Hi,

As this cluster has evolved, we are using a mix of hardware and operating systems on the compute nodes. These are split into separate execution queues. What I am missing is how to properly route to one and overflow to the other.

Using the max_queued limit gets close to a solution “max_queued_res.ncpus = [o:PBS_ALL=260]” except for this counts held jobs as if they were running. The queued jobs in the routing queue that could be running are stuck waiting for held jobs to finish.

I would expect that someone has a clean solution to this configuration. If this requires a queuejob hook, that’s fine. (It would be even better if someone has said hook available.)

The Big Book shows nice ways to split out the queues by core count, or walltime, or memory size. What I did not find was how to feed one queue and when full, move on to another queue.

The PBS Pro version is 13.0.2.153173.

Many thanks,
-Jeff

Hello Jeff,

Could you paste your route queue and destination queue definitions from qmgr here?

Feel free to anonymize the data (e.g. machine names, etc.) if necessary.

Hi Dave,

Here’s most of the queue config info:
[root@<cluster_head_node> ~]# qstat -Qf

Queue: gen8r65
queue_type = Execution
total_jobs = 2
state_count = Transit:0 Queued:0 Held:1 Waiting:0 Running:1 Exiting:0 Begun:0
max_queued_res.ncpus = [o:PBS_ALL=40]
resources_assigned.mpiprocs = 20
resources_assigned.ncpus = 20
resources_assigned.nodect = 1
hasnodes = True
enabled = True
started = True

Queue: gen9r65
queue_type = Execution
total_jobs = 19
state_count = Transit:0 Queued:0 Held:12 Waiting:0 Running:7 Exiting:0 Begun:0
max_queued_res.ncpus = [o:PBS_ALL=180]
resources_assigned.mem = 3gb
resources_assigned.mpiprocs = 74
resources_assigned.ncpus = 77
resources_assigned.nodect = 7
hasnodes = True
enabled = True
started = True

Queue: gen8r64
queue_type = Execution
total_jobs = 26
state_count = Transit:0 Queued:0 Held:3 Waiting:0 Running:23 Exiting:0 Begun:0
max_queued_res.ncpus = [o:PBS_ALL=260]
resources_assigned.mem = 11gb
resources_assigned.mpiprocs = 220
resources_assigned.ncpus = 241
resources_assigned.nodect = 32
hasnodes = True
enabled = True
started = True

Queue: ANSYS160
queue_type = Route
total_jobs = 1
state_count = Transit:0 Queued:0 Held:1 Waiting:0 Running:0 Exiting:0 Begun:0
route_destinations = gen8r64,gen8r65,gen9r65
enabled = True
started = True

Queue: ANSYS171
queue_type = Route
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
route_destinations = gen9r65,gen8r65
enabled = True
started = True

Queue: app_N
queue_type = Route
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
route_destinations = gen8r64,gen9r65
enabled = True
started = True

Queue: app_V
queue_type = Route
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
route_destinations = gen8r64
enabled = True
started = True

Queue: default
queue_type = Route
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
route_destinations = gen8r64
enabled = True
started = True

Queue: gen9
queue_type = Route
total_jobs = 0
state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun:0
route_destinations = gen9r65
enabled = True
started = True

I think the limit you want is max_queued_threshold instead of max_queued. The max_queued_threshold limit deals with jobs in the queued state only whereas max_queued is the total number of jobs. There is both a max_queued_threshold and a max_queued_threshold_res.

Bhroam

Hi Bhroam, I think you are thinking of queued_jobs_threshold, right? If so, it does count held jobs:

[root@centos7 tmp]# qmgr -c "p q r"
#
# Create queues and set their attributes.
#
#
# Create and define queue r
#
create queue r
set queue r queue_type = Route
set queue r route_destinations = workq
set queue r route_destinations += workq2
set queue r enabled = True
set queue r started = True
[root@centos7 tmp]# qmgr -c "p q workq"
#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq from_route_only = True
set queue workq kill_delay = 20
set queue workq enabled = True
set queue workq started = True
set queue workq queued_jobs_threshold = [o:PBS_ALL=4]
[root@centos7 tmp]# qmgr -c "p q workq2"
#
# Create queues and set their attributes.
#
#
# Create and define queue workq2
#
create queue workq2
set queue workq2 queue_type = Execution
set queue workq2 from_route_only = True
set queue workq2 enabled = True
set queue workq2 started = True


[user1@centos7 ~]$ echo "sleep 1000000" | qsub
7303.centos7
[user1@centos7 ~]$ qhold 7303.centos7
[user1@centos7 ~]$ echo "sleep 1000000" | qsub
7304.centos7
[user1@centos7 ~]$ echo "sleep 1000000" | qsub
7305.centos7
[user1@centos7 ~]$ echo "sleep 1000000" | qsub
7306.centos7
[user1@centos7 ~]$ echo "sleep 1000000" | qsub
7307.centos7
[user1@centos7 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
7303.centos7      STDIN            user1                    0 H workq
7304.centos7      STDIN            user1                    0 Q workq
7305.centos7      STDIN            user1                    0 Q workq
7306.centos7      STDIN            user1                    0 Q workq
7307.centos7      STDIN            user1                    0 Q workq2

I tried using the suggested parameter of “queued_jobs_threshold_res.ncpus = [o:PBS_ALL=8]” and found that held jobs were still getting in the way of other jobs that could and should run.

test_head:
                                                                                                   Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
2373.test_head                  some_user         ANSYS171        ANSYS                --     1     4    --    --  Q  --   --
2374.test_head                  some_user         ANSYS171        ANSYS                --     1     4    --    --  Q  --   --
2375.test_head                  some_user         ANSYS171        ANSYS                --     1     4    --    --  Q  --   --
2371.test_head                  some_user         gen9r65         ANSYS                --     1     4    --    --  H  --   --
2372.test_head                  some_user         gen9r65         ANSYS                --     1     4    --    --  H  --   --

It sounded like a good idea and was certainly worth a try. The queued jobs here seem to be blocked by the held jobs.

The demonstration by scc looks like “queued_jobs_threshold” has the same problem with held jobs.

This overall problem strikes me as if I’m missing something fundamental on how to properly configure the routing, or more precisely, how to configure the execution queues so that routing behaves as desired.

I THOUGHT I had a good handle on how this works but that is clearly not the case.

Thanks!

Can you expand a bit on what your end goal is? There is something we call “internal peering” that might help you, depending on what exactly you are trying to achieve by having these jobs occupy different queues. In the “internal peering” scenario you could set up the queues to only take jobs from the routing (or execution, really) queue when they are able to run.

Certainly!

This use case here involves the following:

-Hardware-
There is a mix of two different hardware families (DL380 gen8 & gen9) and two different operating systems (RHEL6.4 & RHEL6.5), soon to be followed by RHEL7.3. The resulting permutations are mapped to execution queues for gen8r64, gen8r65, gen9r65, and there will be a gen9r73.

-Software-
The applications are a mix of large parallel ANSYS jobs and a large quantity of single thread (single core) jobs. We have evolving versions of ANSYS. The applications are only allowed to run on certified platforms. Each application and version has a routing queue that feeds only to the approved execution destinations. In the case of single destinations, the outcome is easy. In the case of multiple destinations, there is a preferred destination followed by a acceptable alternate destination. That is the situation that drives my original question of how to route to a primary execution queue while spilling over to an alternate execution queue if resources are available.

The large parallel jobs and the small singular jobs need to run in a shared environment and play nice.

CPU (or core) allocation seems like the right way to spread the mix of jobs in the execution queues and that was my original intent with this arrangement. The built-in configuration parameters seem to get really close to a working solution except for the held jobs getting in the way.

Simple isolation of destinations would work fine except for the folks who would then have a problem with non-shared access, “Hey, how come my jobs are queued when there is room to run on those other nodes?”

Thanks!

This question remains unresolved. I needed to get the system back to a stable, unattended mode so I configured routing to single execution queues with no limits. It works but it also misses the idea of packing the available execution queues.

If someone has a fix for this idea, please let me know.

Thanks,