Round_robin not working across 2 execution queues

Hello,

We have PBS cluster with a routing queue and 2 execution queues.
The round_robin: true ALL in /var/spool/pbs/sched_priv/sched_config

 set queue xeon1700s Priority = 100
 set queue xeon1800s Priority = 100

/var/spool/pbs/sched_priv/sched_config
backfill_prime:	false	ALL
by_queue: True		non_prime
by_queue: True		prime
dedicated_prefix: ded
fairshare_decay_factor: 0.5
fairshare_decay_time: 24:00:00
fairshare_entity: euser
fair_share: false	ALL
fairshare_usage_res: cput
help_starving_jobs:	true	ALL
max_starve: 24:00:00
node_sort_key: "sort_priority HIGH"	ALL
nonprimetime_prefix: np_
preemptive_sched: true	ALL
prime_exempt_anytime_queues:	false
primetime_prefix: p_
provision_policy: "aggressive_provision"
resources: "ncpus, mem, arch, host, vnode, aoe, eoe, Qlist, acfd_fluent_solver_lic, acfd_cfx_solver_lic, acfd_par_proc_lic"
round_robin: True	all
smp_cluster_dist: pack

but jobs only run in xeon1800s execution queue. PBS put jobs in Q state when xeon1800s queue is full. No jobs are being routed to xeon1700s execution queues Please help to detect the configuration issue. We have OpenPBS 20 installed in the environment.

The goal is to round-robin jobs across xeon1700s/xeon1800s

$ qstat -Q
Queue              Max   Tot Ena Str   Que   Run   Hld   Wat   Trn   Ext Type
---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----
workq                0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1800s            0     1 yes yes     0     1     0     0     0     0 Exe*
xeon1800w            0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1700w            0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1700s            0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1600             0     0 yes yes     0     0     0     0     0     0 Rou*

$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
216.lssd530-hs05  mpi-test.sh      andrei            00:00:00 R xeon1800s       
217.lssd530-hs05  mpi-test.sh      andrei            00:00:00 R xeon1800s       
218.lssd530-hs05  mpi-test.sh      andrei            00:00:00 R xeon1800s       
219.lssd530-hs05  mpi-test.sh      andrei            00:00:00 R xeon1800s       
220.lssd530-hs05  mpi-test.sh      andrei            00:00:00 R xeon1800s       
221.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s       
222.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s       
223.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s       
224.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s       
225.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s       
226.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s       
227.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s       
228.lssd530-hs05  mpi-test.sh      andrei                   0 Q xeon1800s

create queue xeon1600
set queue xeon1600 queue_type = Route
set queue xeon1600 route_destinations = xeon1800s
set queue xeon1600 route_destinations += xeon1700s
set queue xeon1600 enabled = True
set queue xeon1600 started = True

create queue xeon1700s
set queue xeon1700s queue_type = Execution
set queue xeon1700s Priority = 100
set queue xeon1700s resources_max.ncpus = 80
set queue xeon1700s resources_max.Qlist = xeon1700s
set queue xeon1700s resources_min.ncpus = 1
set queue xeon1700s resources_min.Qlist = xeon1700s
set queue xeon1700s resources_default.Qlist = xeon1700s
set queue xeon1700s default_chunk.Qlist = xeon1700s
set queue xeon1700s max_user_run = 9999
set queue xeon1700s enabled = True
set queue xeon1700s started = True

create queue xeon1800s
set queue xeon1800s queue_type = Execution
set queue xeon1800s Priority = 100
set queue xeon1800s resources_max.ncpus = 80
set queue xeon1800s resources_max.Qlist = xeon1800s
set queue xeon1800s resources_min.ncpus = 1
set queue xeon1800s resources_min.Qlist = xeon1800s
set queue xeon1800s resources_default.Qlist = xeon1800s
set queue xeon1800s default_chunk.Qlist = xeon1800s
set queue xeon1800s max_user_run = 9999
set queue xeon1800s enabled = True
set queue xeon1800s started = True


# Set server attributes.
#
set server scheduling = True
set server default_queue = xeon1600
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999


$ pbsnodes  -a
lssd530-cs10
     Mom = lssd530-cs10
     ntype = PBS
     state = free
     pcpus = 104
     resources_available.arch = linux
     resources_available.host = lssd530-cs10
     resources_available.mem = 395571708kb
     resources_available.ncpus = 104
     resources_available.Qlist = xeon1700w,xeon1700s
     resources_available.vnode = lssd530-cs10
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = xeon1700s
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Thu Feb  4 02:52:08 2021
     last_used_time = Thu Feb  4 02:50:14 2021

lssd530-cs12
     Mom = lssd530-cs12
     ntype = PBS
     state = free
     pcpus = 104
     jobs = 207.lssd530-hs05/0, 207.lssd530-hs05/1, 207.lssd530-hs05/2, 207.lssd530-hs05/3, 207.lssd530-hs05/4, 208.lssd530-hs05/5, 208.lssd530-hs05/6, 208.lssd530-hs05/7, 208.lssd530-hs05/8, 208.lssd530-hs05/9, 209.lssd530-hs05/10, 209.lssd530-hs05/11, 209.lssd530-hs05/12, 209.lssd530-hs05/13, 209.lssd530-hs05/14, 210.lssd530-hs05/15, 210.lssd530-hs05/16, 210.lssd530-hs05/17, 210.lssd530-hs05/18, 210.lssd530-hs05/19, 211.lssd530-hs05/20, 211.lssd530-hs05/21, 211.lssd530-hs05/22, 211.lssd530-hs05/23, 211.lssd530-hs05/24, 212.lssd530-hs05/25, 212.lssd530-hs05/26, 212.lssd530-hs05/27, 212.lssd530-hs05/28, 212.lssd530-hs05/29, 213.lssd530-hs05/30, 213.lssd530-hs05/31, 213.lssd530-hs05/32, 213.lssd530-hs05/33, 213.lssd530-hs05/34, 214.lssd530-hs05/35, 214.lssd530-hs05/36, 214.lssd530-hs05/37, 214.lssd530-hs05/38, 214.lssd530-hs05/39, 215.lssd530-hs05/40, 215.lssd530-hs05/41, 215.lssd530-hs05/42, 215.lssd530-hs05/43, 215.lssd530-hs05/44
     resources_available.arch = linux
     resources_available.host = lssd530-cs12
     resources_available.mem = 395571708kb
     resources_available.ncpus = 104
     resources_available.Qlist = xeon1800w,xeon1800s
     resources_available.vnode = lssd530-cs12
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 45
     resources_assigned.vmem = 0kb
     queue = xeon1800s
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Fri Feb  5 12:29:23 2021
     last_used_time = Fri Feb  5 13:18:56 2021

lssd530-cs13
     Mom = lssd530-cs13
     ntype = PBS
     state = free
     pcpus = 104
     resources_available.arch = linux
     resources_available.host = lssd530-cs13
     resources_available.mem = 395571708kb
     resources_available.ncpus = 104
     resources_available.Qlist = xeon1800w,xeon1800s
     resources_available.vnode = lssd530-cs13
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = xeon1800s
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Fri Feb  5 12:29:38 2021
     last_used_time = Fri Feb  5 13:12:49 2021

lssd530-cs09
     Mom = lssd530-cs09
     ntype = PBS
     state = free
     pcpus = 52
     resources_available.arch = linux
     resources_available.host = lssd530-cs09
     resources_available.mem = 65744424kb
     resources_available.ncpus = 104
     resources_available.Qlist = xeon1700w,xeon1700s
     resources_available.vnode = lssd530-cs09
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     queue = xeon1700s
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Thu Feb  4 02:52:08 2021
     last_used_time = Thu Feb  4 02:50:14 2021

Could you please

  1. take a backup of your queue setup
  2. delete the route queue xeon1600
  3. In the sched_config: set the below and kill -HUP < PID of the sched >
    by_queue: False non_prime
    by_queue: False prime
  4. Delete all the jobs from the queue
  5. qmgr : s s scheduling = false
  6. Submit jobs to the queues
  7. qmgr : s s scheduling = True
  8. check the order of execution of jobs from the queues

I have tested and this is how sched_config looks:
cat sched_config | grep -v “#” | grep ^[a-zA-Z]
round_robin: True all
by_queue: False prime
by_queue: False non_prime
strict_ordering: false ALL
help_starving_jobs: true ALL
max_starve: 24:00:00
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
node_sort_key: “sort_priority HIGH” ALL
provision_policy: “aggressive_provision”
sort_queues: true ALL
resources: “ncpus, mem, arch, host, vnode, aoe, eoe, ngpus, hwu_lic, run_this_job, jobtag, serverjobtagresource”
load_balancing: false ALL
smp_cluster_dist: pack
fair_share: false ALL
fairshare_usage_res: cput
fairshare_entity: euser
fairshare_decay_time: 24:00:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
preempt_queue_prio: 150
preempt_prio: “express_queue, normal_jobs”
preempt_order: “SCR”
preempt_sort: min_time_since_start
dedicated_prefix: ded
log_filter: 3328

hope this works

Hi @adarsh
are you suggesting to not use default route queue xeon1600?

$ echo “sleep 1000000” | qsub
qsub: No default queue specified

# Set server attributes.
#
set server scheduling = False
set server default_queue = xeon1600
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999
$ qstat -Q
Queue              Max   Tot Ena Str   Que   Run   Hld   Wat   Trn   Ext Type
---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----
workq                0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1800s            0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1800w            0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1700w            0     0 yes yes     0     0     0     0     0     0 Exe*
xeon1700s            0    10 yes yes    10     0     0     0     0     0 Exe*
xeon1600             0     0 yes yes     0     0     0     0     0     0 Rou*

$ qstat -f 357
Job Id: 357.lssd530-hs05
Job_Name = STDIN
Job_Owner = andrei@lssd530-hs05
job_state = Q
queue = xeon1700s
server = lssd530-hs05
Checkpoint = u
ctime = Sat Feb 6 03:40:03 2021
Error_Path = lssd530-hs05:/home/andrei/STDIN.e357
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Sat Feb 6 03:40:03 2021
Output_Path = lssd530-hs05:/home/andrei/STDIN.o357
Priority = 0
qtime = Sat Feb 6 03:40:03 2021
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = pack
Resource_List.Qlist = xeon1700s
Resource_List.select = 1:ncpus=1:Qlist=xeon1700s
substate = 10
Variable_List = PBS_O_HOME=/home/andrei,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=andrei,
PBS_O_PATH=/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt
/pbs/bin:/home/andrei/.local/bin:/home/andrei/bin,
PBS_O_MAIL=/var/spool/mail/andrei,PBS_O_SHELL=/bin/bash,
PBS_O_WORKDIR=/home/andrei,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=xeon1600,
PBS_O_HOST=lssd530-hs05
comment = Can Never Run: Invalid Job/Resv invalid chunk in select
etime = Sat Feb 6 03:40:03 2021
project = _pbs_project_default
Submit_Host = lssd530-hs05

Thank you for sharing the details, could you please try the below steps and then let us know
Please do not copy and paste, there might be issues(sometimes) with copy pasting the below instructions directly on to the terminal/command window. It is better you can type them.

qmgr -c "print queue @default" > print_queue.txt
qmgr -c "print server" > print_server.txt
source /etc/pbs.conf;cp  $PBS_HOME/sched_priv/sched_config $PBS_HOME/sched_priv/sched_config_backup

qmgr -c "delete queue @default"

qmgr -c "set queue default_queue=workq"

qmgr -c "create queue workq queue_type=e, started=t,enabled=t"
qmgr -c "set queue workq default_chunk.Qlist=xeon1700s"

qmgr -c "create queue testq queue_type=e, started=t,enabled=t"
qmgr -c "set queue testq  default_chunk.Qlist=xeon1800s

qmgr -c "create resource Qlist type=string_array,flagh=h"

replace the contents of your $PBS_HOME/sched_priv/sched_config with the below

round_robin: True all
by_queue: False prime
by_queue: False non_prime
strict_ordering: false ALL
help_starving_jobs: true ALL
max_starve: 24:00:00
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
node_sort_key: “sort_priority HIGH” ALL
provision_policy: “aggressive_provision”
sort_queues: true ALL
resources: “ncpus, mem, arch, host, vnode, aoe, eoe, ngpus, Qlist”
load_balancing: false ALL
smp_cluster_dist: pack
fair_share: false ALL
fairshare_usage_res: cput
fairshare_entity: euser
fairshare_decay_time: 24:00:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
preempt_queue_prio: 150
preempt_prio: “express_queue, normal_jobs”
preempt_order: “SCR”
preempt_sort: min_time_since_start
dedicated_prefix: ded
log_filter: 3328

kill -HUP <PID of the scheduler> 

qmgr -c "set server scheduling = false"

Submit jobs to workq and testq

Open two terminals:
Terminal1: watch qstat -answ1 #its a bad idea to run watch on any of the PBS commands, so not to be used
Terminal2: qmgr -c “set server scheduling = true”

Hi @adarsh
I tried your suggestion but still getting " Can Never Run: Invalid Job/Resv invalid chunk in select"
To clarify, we want to use routing queue and round-robin jobs across 4 execution queue.

$ qstat -answ1
lssd530-hs05:
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
1380.lssd530-hs05 andrei workq STDIN – 1 1 – – Q – –
Can Never Run: Invalid Job/Resv invalid chunk in select
1381.lssd530-hs05 andrei workq STDIN – 1 1 – – Q – –
Can Never Run: Invalid Job/Resv invalid chunk in select
1382.lssd530-hs05 andrei testq STDIN – 1 1 – – Q – –
Can Never Run: Invalid Job/Resv invalid chunk in select
1383.lssd530-hs05 andrei testq STDIN – 1 1 – – Q
Can Never Run: Invalid Job/Resv invalid chunk in select

$ qstat -f 1380
Job Id: 1380.lssd530-hs05
    Job_Name = STDIN
    Job_Owner = andrei@lssd530-hs05
    job_state = Q
    queue = workq
    server = lssd530-hs05
    Checkpoint = u
    ctime = Sun Feb  7 02:43:51 2021
    Error_Path = lssd530-hs05:/home/andrei/STDIN.e1380
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sun Feb  7 02:43:51 2021
    Output_Path = lssd530-hs05:/home/andrei/STDIN.o1380
    Priority = 0
    qtime = Sun Feb  7 02:43:51 2021
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.place = pack
    Resource_List.select = 1:ncpus=1
    substate = 10
    Variable_List = PBS_O_HOME=/home/andrei,PBS_O_LANG=en_US.UTF-8,
	PBS_O_LOGNAME=andrei,
	PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/
	bin:/home/andrei/.local/bin:/home/andrei/bin,
	PBS_O_MAIL=/var/spool/mail/andrei,PBS_O_SHELL=/bin/bash,
	PBS_O_WORKDIR=/home/andrei,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=workq,
	PBS_O_HOST=lssd530-hs05
    comment = Can Never Run: Invalid Job/Resv invalid chunk in select
    etime = Sun Feb  7 02:43:51 2021
    Submit_arguments = -q workq
    project = _pbs_project_default
    Submit_Host = lssd530-hs05
  1. Sure, we(I) wanted to make sure the round robin works without routing queues(basics) first and then try with routing queue.

  2. Create another host-level string_array resource called machine_type

  • qmgr : create resource machine_type type=string_array,flag=h
  • Add it to the $PBS_HOME/sched_priv/sched_config’s resource line
    resources: “ncpus, mem, arch, host, vnode, aoe, eoe, ngpus, Qlist, machine_type”
  • kill -HUP < PID of the scheduler >
  1. Set the nodes with the respective machine_types with the custom resource
    qmgr -c “set node NODENAM resources_available.machine_type=xeon1700s”
    qmgr -c “set node NODENAM resources_available.machine_type=xeon1700s”

  2. you need to sanitize your execution queues as below

create queue xeon1700s
set queue xeon1700s queue_type = Execution
set queue xeon1700s Priority = 100
set queue xeon1700s resources_max.ncpus = 40
set queue xeon1700s resources_min.ncpus = 1
set queue xeon1700s default_chunk.Qlist = xeon1700s
set queue xeon1700s enabled = True
set queue xeon1700s started = True

create queue xeon1800s
set queue xeon1800s queue_type = Execution
set queue xeon1800s Priority = 100
set queue xeon1800s resources_max.ncpus = 80
set queue xeon1800s resources_min.ncpus = 40
set queue xeon1800s default_chunk.Qlist = xeon1800s
set queue xeon1800s enabled = True
set queue xeon1800s started = True

And then submit jobs to routing queue and check whether it works
qsub -q xeon1600 -l select=1:ncpus=20 – /bin/sleep 1000
qsub -q xeon1600 -l select=1:ncpus=20 – /bin/sleep 1000
qsub -q xeon1600 -l select=1:ncpus=20 – /bin/sleep 1000
qsub -q xeon1600 -l select=1:ncpus=20 – /bin/sleep 1000

qsub -q xeon1600 -l select=2:ncpus=20 – /bin/sleep 1000
qsub -q xeon1600 -l select=2:ncpus=20 – /bin/sleep 1000
qsub -q xeon1600 -l select=2:ncpus=20 – /bin/sleep 1000
qsub -q xeon1600 -l select=2:ncpus=20 – /bin/sleep 1000

If you would not like to go the above way, then please share the below

  1. increase the server and scheduler log level
  2. submit couple of jobs to routing queue, share your qsub command line used for these job submissions
  3. share the qstat -fx of the jobs and respective days server/scheduler logs
  4. It seems to me the resources that are set in your execution queues have conflicting request in the qsub’s chunk statement.

Hi @adarsh

I tried but still not seeing round-robin working across 2 queues

create node lssd530-cs09
set node lssd530-cs09 state = free
set node lssd530-cs09 resources_available.arch = linux
set node lssd530-cs09 resources_available.host = lssd530-cs09
set node lssd530-cs09 resources_available.machine_type = xeon1700s
set node lssd530-cs09 resources_available.mem = 65744424kb
set node lssd530-cs09 resources_available.ncpus = 104
set node lssd530-cs09 resources_available.Qlist = xeon1700s
set node lssd530-cs09 resources_available.vnode = lssd530-cs09
set node lssd530-cs09 queue = xeon1700s
set node lssd530-cs09 resv_enable = True

create node lssd530-cs10
set node lssd530-cs10 state = free
set node lssd530-cs10 resources_available.arch = linux
set node lssd530-cs10 resources_available.host = lssd530-cs10
set node lssd530-cs10 resources_available.machine_type = xeon1700s
set node lssd530-cs10 resources_available.mem = 395571708kb
set node lssd530-cs10 resources_available.ncpus = 104
set node lssd530-cs10 resources_available.Qlist = xeon1700s
set node lssd530-cs10 resources_available.vnode = lssd530-cs10
set node lssd530-cs10 queue = xeon1700s
set node lssd530-cs10 resv_enable = True

create node lssd530-cs12
set node lssd530-cs12 state = free
set node lssd530-cs12 resources_available.arch = linux
set node lssd530-cs12 resources_available.host = lssd530-cs12
set node lssd530-cs12 resources_available.machine_type = xeon1800s
set node lssd530-cs12 resources_available.mem = 395571708kb
set node lssd530-cs12 resources_available.ncpus = 104
set node lssd530-cs12 resources_available.Qlist = xeon1800s
set node lssd530-cs12 resources_available.vnode = lssd530-cs12
set node lssd530-cs12 queue = xeon1800s
set node lssd530-cs12 resv_enable = True

create node lssd530-cs13
set node lssd530-cs13 state = free
set node lssd530-cs13 resources_available.arch = linux
set node lssd530-cs13 resources_available.host = lssd530-cs13
set node lssd530-cs13 resources_available.machine_type = xeon1800s
set node lssd530-cs13 resources_available.mem = 395571708kb
set node lssd530-cs13 resources_available.ncpus = 104
set node lssd530-cs13 resources_available.Qlist = xeon1800s
set node lssd530-cs13 resources_available.vnode = lssd530-cs13
set node lssd530-cs13 queue = xeon1800s
set node lssd530-cs13 resv_enable = True

qmgr p s output

create resource Qlist
set resource Qlist type = string_array
set resource Qlist flag = h

create resource machine_type
set resource machine_type type = string_array
set resource machine_type flag = h

create queue xeon1700s
set queue xeon1700s queue_type = Execution
set queue xeon1700s Priority = 100
set queue xeon1700s resources_max.ncpus = 40
set queue xeon1700s resources_min.ncpus = 1
set queue xeon1700s default_chunk.Qlist = xeon1700s
set queue xeon1700s enabled = True
set queue xeon1700s started = True

create queue xeon1800s
set queue xeon1800s queue_type = Execution
set queue xeon1800s Priority = 100
set queue xeon1800s resources_max.ncpus = 80
set queue xeon1800s resources_min.ncpus = 40
set queue xeon1800s default_chunk.Qlist = xeon1800s
set queue xeon1800s enabled = True
set queue xeon1800s started = True

create queue xeon1600
set queue xeon1600 queue_type = Route
set queue xeon1600 route_destinations = xeon1700s
set queue xeon1600 route_destinations += xeon1800s
set queue xeon1600 enabled = True
set queue xeon1600 started = True

# Set server attributes.
#
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999
Qmgr: 

12 Job Submitted to xeon1600 routing queue

$ echo "sleep 100" | qsub -q xeon1600 -l select=1:ncpus=20

$ qstat
    Job id            Name             User              Time Use S Queue
    ----------------  ---------------- ----------------  -------- - -----
    1429.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1430.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1431.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1432.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1433.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1434.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1435.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1436.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1437.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1438.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
    1439.lssd530-hs05 STDIN            andrei                   0 Q xeon1700s       
    1440.lssd530-hs05 STDIN            andrei                   0 Q xeon1700s       

$ qstat -answ1
lssd530-hs05: 
                                                                                               Req'd  Req'd   Elap
Job ID                         Username        Queue           Jobname         SessID   NDS  TSK   Memory Time  S Time
------------------------------ --------------- --------------- --------------- -------- ---- ----- ------ ----- - -----
1429.lssd530-hs05              andrei          xeon1700s       STDIN             183418    1    20    --    --  R 00:01 lssd530-cs10/0*20
   Job run at Sun Feb 07 at 06:33 on (lssd530-cs10:ncpus=20)
1430.lssd530-hs05              andrei          xeon1700s       STDIN             183443    1    20    --    --  R 00:01 lssd530-cs10/1*20
   Job run at Sun Feb 07 at 06:33 on (lssd530-cs10:ncpus=20)
1431.lssd530-hs05              andrei          xeon1700s       STDIN             183472    1    20    --    --  R 00:01 lssd530-cs10/2*20
   Job run at Sun Feb 07 at 06:33 on (lssd530-cs10:ncpus=20)
1432.lssd530-hs05              andrei          xeon1700s       STDIN             183497    1    20    --    --  R 00:01 lssd530-cs10/3*20
   Job run at Sun Feb 07 at 06:33 on (lssd530-cs10:ncpus=20)
1433.lssd530-hs05              andrei          xeon1700s       STDIN             183521    1    20    --    --  R 00:01 lssd530-cs10/4*20
   Job run at Sun Feb 07 at 06:33 on (lssd530-cs10:ncpus=20)
1434.lssd530-hs05              andrei          xeon1700s       STDIN             157909    1    20    --    --  R 00:01 lssd530-cs09/0*20
   Job run at Sun Feb 07 at 06:33 on (lssd530-cs09:ncpus=20)
1435.lssd530-hs05              andrei          xeon1700s       STDIN             158000    1    20    --    --  R 00:00 lssd530-cs09/1*20
   Job run at Sun Feb 07 at 06:34 on (lssd530-cs09:ncpus=20)
1436.lssd530-hs05              andrei          xeon1700s       STDIN             158025    1    20    --    --  R 00:00 lssd530-cs09/2*20
   Job run at Sun Feb 07 at 06:34 on (lssd530-cs09:ncpus=20)
1437.lssd530-hs05              andrei          xeon1700s       STDIN             158050    1    20    --    --  R 00:00 lssd530-cs09/3*20
   Job run at Sun Feb 07 at 06:34 on (lssd530-cs09:ncpus=20)
1438.lssd530-hs05              andrei          xeon1700s       STDIN             158074    1    20    --    --  R 00:00 lssd530-cs09/4*20
   Job run at Sun Feb 07 at 06:34 on (lssd530-cs09:ncpus=20)
1439.lssd530-hs05              andrei          xeon1700s       STDIN                --     1    20    --    --  Q   --   -- 
   Not Running: Insufficient amount of resource: ncpus (R: 20 A: 8 T: 208)
1440.lssd530-hs05              andrei          xeon1700s       STDIN                --     1    20    --    --  Q   --   -- 
   Not Running: Insufficient amount of resource: ncpus (R: 20 A: 8 T: 208)

    $ qstat -f 1451
    Job Id: 1451.lssd530-hs05
        Job_Name = STDIN
        Job_Owner = andrei@lssd530-hs05
        job_state = Q
        queue = xeon1700s
        server = lssd530-hs05
        Checkpoint = u
        ctime = Sun Feb  7 06:38:41 2021
        Error_Path = lssd530-hs05:/home/andrei/STDIN.e1451
        Hold_Types = n
        Join_Path = n
        Keep_Files = n
        Mail_Points = a
        mtime = Sun Feb  7 06:38:41 2021
        Output_Path = lssd530-hs05:/home/andrei/STDIN.o1451
        Priority = 0
        qtime = Sun Feb  7 06:38:41 2021
        Rerunable = True
        Resource_List.ncpus = 20
        Resource_List.nodect = 1
        Resource_List.place = free
        Resource_List.select = 1:ncpus=20
        substate = 10
        Variable_List = PBS_O_HOME=/home/andrei,PBS_O_LANG=en_US.UTF-8,
    	PBS_O_LOGNAME=andrei,
    	PBS_O_PATH=/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/pbs/
    	bin:/home/andrei/.local/bin:/home/andrei/bin,
    	PBS_O_MAIL=/var/spool/mail/andrei,PBS_O_SHELL=/bin/bash,
    	PBS_O_WORKDIR=/home/andrei,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=xeon1600,
    	PBS_O_HOST=lssd530-hs05
        comment = Not Running: Insufficient amount of resource: ncpus (R: 20 A: 8 T
    	: 208)
        etime = Sun Feb  7 06:38:41 2021
        Submit_arguments = -q xeon1600 -l select=1:ncpus=20
        project = _pbs_project_default
        Submit_Host = lssd530-hs05

Submitted 50 ncpus job and its routed to xeon1800 queue

$ echo "sleep 100" | qsub -q xeon1600 -l select=1:ncpus=50
1441.lssd530-hs05

$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----     
1435.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
1436.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
1437.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
1438.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1700s       
1439.lssd530-hs05 STDIN            andrei                   0 Q xeon1700s       
1440.lssd530-hs05 STDIN            andrei                   0 Q xeon1700s       
1441.lssd530-hs05 STDIN            andrei            00:00:00 R xeon1800s

At a high level, what do you want to accomplish? You have four very similar compute nodes. It looks like you are trying to spread the jobs evenly across the nodes. Do you have some other criteria for where jobs should run? Is the difference between the 1700 and 1800 nodes significant?

If you just want to balance the load, consider removing all your queues except workq and setting the scheduler parameter node_sort_key to "ncpus HIGH unused". This causes the scheduler to run a job on the nodes with the most free CPUs. See section 4.9.50 of the admin guide.

Note that if you have many MPI jobs needing different numbers of CPUs, this might not be the best choice. It tends to starve jobs needing large numbers of CPUs. If large jobs are important, you should switch HIGH to LOW. Your load across nodes will be uneven, but wide jobs will run sooner.

the goal is to submit jobs to a default route queue xeon1600 and load balance jobs across 4 exec queues (xeon1700s/w and xeon1800s/w) based on the resources_max_ncpus and resources_max_ncpus.

Currently we able to submit jobs to the default route queue, jobs run from xeon1700s and xeon1700w. But not from xeon1800s/xeon1800w queues. My hope was that having
“round_robin: True all” in sched_config would allow us to load balance jobs across xeon1700s/1800s queues.

Qmgr: print queue xeon1600
    create queue xeon1600
    set queue xeon1600 queue_type = Route
    set queue xeon1600 route_destinations = xeon1700s
    set queue xeon1600 route_destinations += xeon1800s
    set queue xeon1600 route_destinations += xeon1700w
    set queue xeon1600 route_destinations += xeon1800w
    set queue xeon1600 enabled = True
    set queue xeon1600 started = True

In the lab, 1700 and 1800 nodes have same hardware specs, but on production system xeon1700 and xeon1800 are 2 different clusters and hardware specs (CPU/mem) different.

Correct, we are trying to spread the jobs evenly across the nodes in 1700 and 1800 queues. Only criteria we have is resource_max_ncpus and resource_min_ncpus (see diagram above)

thanks in advance

First, you cannot use a routing queue the way you want. Routing is done using static criteria: Does the job match the criteria for the first queue? If so, move it to that queue. If not, try the next queue, etc. This is why all your jobs end up in the first queue. Round-robin applies only after the jobs are in execution queues.

Going back to a high level, it seems to me you have two significant node selection criteria: the hardware type and the local network. You don’t want MPI jobs split across hardware types or across networks. If I understand correctly, all of the 1700 nodes are on one network and the 1800 nodes are on the other network. So, there are really just two classes of nodes.

I would solve it using placement sets. See section 4.9.32.4 of the admin guide. Create a node-level string resource “cluster” that has the value of “xeon1700” or “xeon1800”. Set this resource appropriately for all your nodes. Add “cluster” to your scheduler resources list.

Now, to get all the nodes from a job to be in the same cluster, add “place=group=cluster” as a default.

As with my previous comment, you don’t need separate queues for this. You can get by with one queue. To spread jobs more evenly, use the node_sort_key as mentioned above. However, given that you will have large MPI jobs, I would set node_sort_key="ncpus LOW unused" so that your 1 CPU jobs get shared onto the fewest nodes, leaving more intact nodes available for wide jobs using all their cores.

You might want to use multiple queues to assign different priorities to small versus wide jobs (your “s” and “w” queues). So, replace your two xeon1X00s queues with a single queue “small” with the same min and max ncpus. Add default_chunk=cluster=xeon1700 to the xeon1700w queue. Add default_chunk=cluster=xeon1800 to the xeon1800w queue.

Now, change your route queue to list the route_destinations in the order “xeon1800w,xeon1700w,small”. (If you prefer to use the 1700s before the 1800s, swap their order.)

In any case, you don’t want any nodes assigned to specific queues, so qmgr -c 'unset node @default queue'.

I haven’t tested this, but it’s not too different from what we did at my previous job. You might end up needing to write a qsub hook to make sure the jobs have enough resources specified so the scheduler can place them correctly.

1 Like

Hi @dtalcott

thank you for your feedback. Even though the placement sets seems like a good option to explore, I am wondering why the native round-robin feature described in 4.9.38 won’t work for us. We have priority set to the same value for all queues.

4.9.38 Round Robin Queue Selection

PBS can select jobs from queues by examining the queues in round-robin fashion. The behavior is round-robin only when you have groups of queues where all queues in each group have the same priority.

The order in which queues are selected is determined by each queue’s priority. You can set each queue’s priority; see section 2.3.5.3, “Prioritizing Execution Queues”, on page 23. If queue priorities are not set, they are undefined. If you do not prioritize the queues, their order is undefined.

When you have multiple queues with the same priority, a scheduler round-robins through all of the queues with the same priority as a group. So if you have Q1, Q2, and Q3 at a priority of 100, Q4 and Q5 at a priority of 50, and Q6 at a priority of 10, a scheduler will round-robin through Q1, Q2, and Q3 until all of those jobs are out of the way, then the scheduler will round-robin through Q4 and Q5 until there are no more jobs in them, and finally the scheduler will go through Q6.

When using the round-robin method with queues that have unique priorities, a scheduler runs all jobs from the first queue, then runs all the jobs in the next queue, and so on.

To specify that PBS should the round-robin method to select jobs, set the value of the round_robin scheduler parameter to True.

The round_robin parameter is a primetime option, meaning that you can configure it separately for primetime and non-primetime, or you can specify it for all of the time.

You can use the round-robin method as a resource allocation tool. For example, if you need to run the same number of jobs from each group, you can put each group’s jobs in a different queue, and then use round-robin to run jobs, one from each queue.

Round-robin applies only to scheduling jobs already in execution queues. It does not apply to moving jobs from a routing queue to execution queues. You cannot use a routing queue the way you want.