Preemption issues in OpenPBS 20.0.1

Following an upgrade from 18 to 20.0.1, preemption is not taking place as expected. I suspect something odd is happening due to custom resources. Jobs in express_queues are not preempting lower priority queues and jobs in the higher priority low-priority queue are not preempting the lowest priority queue.

There are at least 20 nodes which meet the requirements of the qsub running jobs from lower priority queues when I submit:

qsub -I -lselect=9 -q user_a

but the jobs submitted to queue user_a don’t seem to get preempted.

From a tracejob:

10/15/2021 13:27:56  L    Insufficient amount of resource: xeon6148_sockets 
10/15/2021 13:29:56  L    Considering job to run
10/15/2021 13:29:56  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2
10/15/2021 13:29:56  L    Employing preemption to try and run high priority job.
10/15/2021 13:29:56  L    Allocated one subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2
10/15/2021 13:29:56  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2
10/15/2021 13:29:56  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2
10/15/2021 13:29:56  L    Limited running jobs used for preemption from 56 to 17
10/15/2021 13:29:56  L    Found no preemptable candidates
10/15/2021 13:29:56  L    Insufficient amount of resource: xeon6148_sockets 

To run through the setup:

We have two types of nodes defined of the pattern:

create node compute001
set node compute001 state = job-busy
set node compute001 resources_available.arch = linux
set node compute001 resources_available.host = compute001
set node compute001 resources_available.mem = 196498720kb
set node compute001 resources_available.ncpus = 40
set node compute001 resources_available.vnode = compute001
set node compute001 resources_available.xeon6148_sockets = 2
set node compute001 resv_enable = True

and:

create node other001
set node other001 state = job-busy
set node other001 resources_available.arch = linux
set node other001 resources_available.host = compute001
set node other001 resources_available.mem = 196498720kb
set node other001 resources_available.ncpus = 40
set node other001 resources_available.vnode = compute001
set node other001 resources_available.xeon8268_sockets = 2
set node other001 resv_enable = True

We have a collection of express queues, we’ll call user_a, user_b, user_c of the pattern:

create queue user_a
set queue user_a Priority = 1000
set queue user_a resources_default.mpiprocs = 40
set queue user_a resources_default.ncpus = 40
set queue user_a default_chunk.mpiprocs = 40
set queue user_a default_chunk.ncpus = 40
set queue user_a default_chunk.xeon6148_sockets = 2
set queue user_a resources_available.ncpus = 360
set queue user_a max_user_res.ncpus = 360
set queue user_a enabled = True
set queue user_a started = True

and then two lower priority queues, s1 and s2, are defined:

create queue s1
set queue s1 queue_type = Execution
set queue s1 Priority = 100
set queue s1 acl_host_enable = False
set queue s1 acl_user_enable = True
set queue s1 resources_max.walltime = 72:00:00
set queue s1 resources_min.walltime = 00:00:00
set queue s1 resources_default.mpiprocs = 40
set queue s1 resources_default.ncpus = 40
set queue s1 resources_default.preempt_targets = QUEUE=s2
set queue s1 default_chunk.mpiprocs = 40
set queue s1 default_chunk.ncpus = 40
set queue s1 resources_available.ncpus = 6040
set queue s1 max_user_res.ncpus = 720
set queue s1 max_user_run_soft = 0
set queue s1 enabled = True
set queue s1 started = True

and:

create queue s2
set queue s2 queue_type = Execution
set queue s2 Priority = -1000
set queue s2 acl_host_enable = False
set queue s2 acl_user_enable = True
set queue s2 resources_max.walltime = 72:00:00
set queue s2 resources_min.walltime = 00:00:00
set queue s2 resources_default.mpiprocs = 40
set queue s2 resources_default.ncpus = 40
set queue s2 resources_default.preempt_targets = NONE
set queue s2 default_chunk.mpiprocs = 40
set queue s2 default_chunk.ncpus = 40
set queue s2 resources_available.ncpus = 8040
set queue s2 max_user_res.ncpus = 8040
set queue s2 max_user_run_soft = 0
set queue s2 enabled = True
set queue s2 started = True

the server is set:

set server scheduling = True
set server default_queue = s2
set server log_events = 2047
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server resources_default.preempt_targets = QUEUE=s1
set server resources_default.preempt_targets += QUEUE=s2
set server default_chunk.ncpus = 1
set server scheduler_iteration = 120
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server default_qsub_arguments = -keod
set server rpp_highwater = 16000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server job_history_enable = True
set server job_history_duration = 00:30:00
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999

and the scheduler is set:

create sched default
set sched sched_host = deluge-hn1.cm.cl7.hpc.lle.rochester.edu
set sched sched_cycle_length = 00:20:00
set sched sched_preempt_enforce_resumption = False
set sched preempt_targets_enable = True
set sched sched_port = 15004
set sched sched_priv =/var/spool/pbs/sched_priv
set sched sched_log = /var/spool/pbs/sched_logs
set sched scheduling = True
set sched scheduler_iteration = 120
set sched state = idle
set sched preempt_queue_prio = 150
set sched preempt_prio = "express_queue, normal_jobs"
set sched preempt_order = R
set sched preempt_sort = min_time_since_start
set sched log_events = 4095
set sched server_dyn_res_alarm = 30

I should also have included sched_config:

backfill_prime: false   ALL
by_queue: True          non_prime
by_queue: True          prime
dedicated_prefix: ded
fairshare_decay_factor: 0.5
fairshare_decay_time: 24:00:00
fairshare_entity: euser
fair_share: false       ALL
fairshare_usage_res: cput
help_starving_jobs:     false   ALL
load_balancing: false   ALL
max_starve: 24:00:00
node_sort_key: "sort_priority HIGH"     ALL
nonprimetime_prefix: np_
preemptive_sched: true  ALL
preempt_order: "R"
prime_exempt_anytime_queues:    false
primetime_prefix: p_
provision_policy: "aggressive_provision"
resources: "ncpus, mem, arch, host, vnode, aoe, ngpus, location, xeon8268_sockets, xeon6148_sockets"
round_robin: False      all
sched_preempt_enforce_resumption: False
smp_cluster_dist: pack
sort_queues:    true    ALL
strict_ordering: false ALL

There was a documented change between 18 and 20. Most of the preemption options were moved from the sched_config into qmgr. Do a qmgr -c ‘list sched’ and move all of the sched_config options there.

Bhroam

Already set there - do we need to elimnate them from the config file as well?

On the advice of another site, we set:resources_default.preempt_targets = QUEUE=s1,QUEUE=s2 on all of the express queues.

qmgr -c 'list sched'
Sched default
    sched_host = deluge-hn1
    pbs_version = 20.0.1
    sched_cycle_length = 00:20:00
    sched_preempt_enforce_resumption = False
    preempt_targets_enable = True
    sched_port = 15004
    sched_priv = /cm/shared/apps/openpbs/var/spool/sched_priv
    sched_log = /cm/shared/apps/openpbs/var/spool/sched_logs
    scheduling = True
    scheduler_iteration = 120
    state = idle
    preempt_queue_prio = 150
    preempt_prio = express_queue, normal_jobs
    preempt_order = R
    preempt_sort = min_time_since_start
    log_events = 3328
    server_dyn_res_alarm = 30

The cuious bit remains that we see Insufficient amount of resource messages in the logs though there are a large number of preemptable jobs that provide the resource:

tail -n 10 -f ../sched_logs/20211021  | grep -e'Simulation\|preempt\|145985'
10/21/2021 12:52:00;0800;pbs_sched;Job;check_max_user_res;145985.deluge-hn1 user wscu max_*user_res.ncpus (-2.0, 360.0), used 0.0
10/21/2021 12:52:00;0400;pbs_sched;Node;145985.deluge-hn1;Evaluating subchunk: xeon6148_sockets=2:mpiprocs=40:ncpus=40
10/21/2021 12:52:00;0400;pbs_sched;Node;145985.deluge-hn1;Failed to satisfy subchunk: 6:xeon6148_sockets=2:mpiprocs=40:ncpus=40
10/21/2021 12:52:00;0100;pbs_sched;Job;145985.deluge-hn1;Employing preemption to try and run high priority job.
10/21/2021 12:52:00;0800;pbs_sched;Job;check_max_user_res;145985.deluge-hn1 user wscu max_*user_res.ncpus (-2.0, 360.0), used 0.0
10/21/2021 12:52:00;0400;pbs_sched;Node;145985.deluge-hn1;Evaluating subchunk: xeon6148_sockets=2:mpiprocs=40:ncpus=40
10/21/2021 12:52:00;0400;pbs_sched;Node;145985.deluge-hn1;Failed to satisfy subchunk: 6:xeon6148_sockets=2:mpiprocs=40:ncpus=40
10/21/2021 12:52:00;0100;pbs_sched;Job;145985.deluge-hn1;Limited running jobs used for preemption from 61 to 22
10/21/2021 12:52:00;0100;pbs_sched;Job;145985.deluge-hn1;Found no preemptable candidates
10/21/2021 12:52:00;0100;pbs_sched;Job;145803.deluge-hn1;Employing preemption to try and run high priority job.
10/21/2021 12:52:00;0100;pbs_sched;Job;145803.deluge-hn1;Limited running jobs used for preemption from 61 to 1
10/21/2021 12:52:00;0100;pbs_sched;Job;145656.deluge-hn1;Simulation: preempting job
10/21/2021 12:52:00;0100;pbs_sched;Job;145803.deluge-hn1;Simulation: not enough work preempted: Insufficient amount of resource: ncpus 
10/21/2021 12:52:00;0100;pbs_sched;Job;145830.deluge-hn1;Employing preemption to try and run high priority job.
10/21/2021 12:52:00;0100;pbs_sched;Job;145830.deluge-hn1;Limited running jobs used for preemption from 61 to 1
10/21/2021 12:52:00;0100;pbs_sched;Job;145656.deluge-hn1;Simulation: preempting job
10/21/2021 12:52:00;0100;pbs_sched;Job;145830.deluge-hn1;Simulation: not enough work preempted: Insufficient amount of resource: ncpus 
10/21/2021 12:52:00;0100;pbs_sched;Job;145925.deluge-hn1;Employing preemption to try and run high priority job.
10/21/2021 12:52:00;0100;pbs_sched;Job;145925.deluge-hn1;Limited running jobs used for preemption from 61 to 1
10/21/2021 12:52:00;0100;pbs_sched;Job;145656.deluge-hn1;Simulation: preempting job
10/21/2021 12:52:00;0100;pbs_sched;Job;145925.deluge-hn1;Simulation: not enough work preempted: Insufficient amount of resource: ncpus 
10/21/2021 12:52:00;0100;pbs_sched;Job;145787.deluge-hn1;Employing preemption to try and run high priority job.
10/21/2021 12:52:00;0100;pbs_sched;Job;145787.deluge-hn1;No preemption set specified for the job: Job will not preempt

I suspect it is the preempt_targets. This is meant to limit the number of jobs to preempt.
Notice:

10/21/2021 12:52:00;0100;pbs_sched;Job;145830.deluge-hn1;Limited running jobs used for preemption from 61 to 1

This is saying there are 61 jobs that are in the valid preemption level, but you have limited that list down to just one job.

I’d try removing the preempt_target and see if the problem still persists.

Bhroam

We tried removing

set server resources_default.preempt_targets = QUEUE=s1
set server resources_default.preempt_targets += QUEUE=s2

and setting it per user queue (user jobs should always preempt s1 and s2):

set queue user_a resources_default.preempt_targets = QUEUE=s1
set queue user_a resources_default.preempt_targets += QUEUE=s2

which didn’t led to expected preemption behavior. We then unset preempt_targets on the user queues and have seen no change in behavior:

Job: 147847.deluge-hn1

11/11/2021 12:47:11  L    Employing preemption to try and run high priority job.
11/11/2021 12:47:11  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:47:11  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:47:11  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:47:11  L    Allocated one subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:47:11  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:47:11  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:47:11  L    Found no preemptable candidates
11/11/2021 12:47:11  S    enqueuing into user_a, state 1 hop 1
11/11/2021 12:47:11  S    Job Queued at request of user_a@headnode, owner = user_a@headnode, job name = STDIN, queue = user_a
11/11/2021 12:47:11  S    Job Modified at request of Scheduler@headnode
11/11/2021 12:52:11  L    Employing preemption to try and run high priority job.
11/11/2021 12:52:11  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:52:11  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:52:11  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:52:11  L    Allocated one subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:52:11  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:52:11  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:52:11  L    Found no preemptable candidates
11/11/2021 12:57:10  L    Employing preemption to try and run high priority job.
11/11/2021 12:57:10  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:57:10  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:57:10  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:57:10  L    Allocated one subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:57:10  L    Evaluating subchunk: mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:57:10  L    Failed to satisfy subchunk: 9:mpiprocs=40:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0
11/11/2021 12:57:10  L    Found no preemptable candidates
11/11/2021 12:57:10  S    Job Modified at request of Scheduler@headnode
[wscu@deluge-hn1 ~]$ qstat -f 147847
Job Id: 147847.deluge-hn1
    Job_Name = STDIN
    Job_Owner = user_a@headnode
    job_state = Q
    queue = user_a
    server = headnode
    Checkpoint = u
    ctime = Thu Nov 11 12:47:11 2021
    Error_Path = headnode:/home/user_a/STDIN.e
        147847
    Hold_Types = n
    interactive = True
    Join_Path = n
    Keep_Files = eod
    Mail_Points = a
    mtime = Thu Nov 11 12:47:11 2021
    Output_Path = headnode:/home/user_a/STDIN.
        o147847
    Priority = 0
    qtime = Thu Nov 11 12:47:11 2021
    Rerunable = False
    Resource_List.mpiprocs = 360
    Resource_List.ncpus = 360
    Resource_List.nodect = 9
    Resource_List.place = free
    Resource_List.select = 9
    Resource_List.xeon6148_sockets = 18
    Resource_List.xeon8268_sockets = 0
    substate = 10
    Variable_List = PBS_O_HOME=/home/user_a,PBS_O_LANG=en_US.UTF-8,
        PBS_O_LOGNAME=user_a,
        PBS_O_PATH=/home/user_a
        PBS_O_MAIL=/var/spool/mail/user_a,PBS_O_SHELL=/bin/bash,
        PBS_O_WORKDIR=/home/user_a,PBS_O_SYSTEM=Linux,PBS_O_QUEUE=user_a,
        PBS_O_HOST=headnode
    comment = Not Running: Insufficient amount of resource: ncpus (R: 360 A: 33
        6 T: 8672)
    etime = Thu Nov 11 12:47:11 2021
    Submit_arguments = -I -q user_a -lselect=9
    project = _pbs_project_default
    Submit_Host = headnode

There are more than enough nodes in the low-priority preemptable queues to satisfy the request of the user job:

[root@headnode ~]# qstat -n s1 | sed 's/+/\n/g' | sort | grep compute | wc -l
54
[root@headnode ~]# qstat -n s2 | sed 's/+/\n/g' | sort | grep compute | wc -l
6

We’ve tried just about every combination and permutation of configuration here without discovering conditions that led to working preemption. At this juncture, an external script is being used to qhold and qrerun jobs that should be preempted by the scheduler when higher priority jobs come along.

[user_d@headnode ~]$ tracejob 148184 

Job: 148184.deluge-hn1

11/17/2021 09:56:51  S    enqueuing into user_b, state 1 hop 1
11/17/2021 09:56:51  S    Job Queued at request of user_b@headnode, owner = user_b@headnode, job name = 0.5_125kK, queue = user_b
11/17/2021 09:56:51  S    Job Modified at request of Scheduler@headnode
11/17/2021 10:24:53  S    Job Modified at request of Scheduler@headnode
11/17/2021 10:25:24  S    Job Modified at request of Scheduler@headnode
11/17/2021 10:35:13  S    Job Modified at request of Scheduler@headnode
11/17/2021 10:41:35  S    Job Modified at request of Scheduler@headnode
11/17/2021 11:16:00  S    Job Modified at request of Scheduler@headnode
11/17/2021 11:55:08  S    Job Modified at request of Scheduler@headnode
11/17/2021 12:05:08  S    Job Modified at request of Scheduler@headnode
11/17/2021 12:08:52  L    Employing preemption to try and run high priority job.
11/17/2021 12:08:52  L    Failed to satisfy subchunk: 5:ncpus=40:mpiprocs=40:ompthreads=1:xeon6148_sockets=2:xeon8268_sockets=0
11/17/2021 12:08:52  L    Found no preemptable candidates
11/17/2021 12:08:52  S    Job Modified at request of Scheduler@headnode
11/17/2021 12:08:52  S    Job Modified at request of Scheduler@headnode
11/17/2021 12:08:54  L    Evaluating subchunk: ncpus=40:mpiprocs=40:ompthreads=1:xeon6148_sockets=2:xeon8268_sockets=0
11/17/2021 12:08:54  L    Allocated one subchunk: ncpus=40:mpiprocs=40:ompthreads=1:xeon6148_sockets=2:xeon8268_sockets=0
11/17/2021 12:08:54  S    Job Run at request of Scheduler@headnode on exec_vnode (compute037:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0)+(compute035:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0)+(compute034:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0)+(compute033:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0)+(com
pute063:ncpus=40:xeon6148_sockets=2:xeon8268_sockets=0)
11/17/2021 12:08:55  S    update from Mom without session id

I see that user_a queue has a limit set to max_user_res.ncpus=360.
I think, This limit will restrict the preemptable candidates to be the jobs that matches the high priority job’s user. It also limits the jobs to be in the same queue as the high priority job because the limit is on the queue, so it will only try to target jobs of the same queue.

Can you please try the suspension scenario after unsetting the queue user_a’s limit?

Thanks

I think I’ve stumbled into a partial solution. The limiting factor are the nodes with the custom resource xeon6148_sockets and xeon6148_sockets defined:

#
# Create and define resource xeon8268_sockets
#
create resource xeon8268_sockets
set resource xeon8268_sockets type = long
set resource xeon8268_sockets flag = hn
#
# Create and define resource xeon6148_sockets
#
create resource xeon6148_sockets
set resource xeon6148_sockets type = long
set resource xeon6148_sockets flag = hn

Since the s1 and s2 queues don’t have a default_chunk or resources_default for the xeon6148_sockets or xeon8268_sockets type, allowing them to use any node, a job a user submits to those queues don’t request those resources unless the user explicitly requests them. Moreover, they don’t show up as resources_used in the job, as an entry in the job’s Resource_List, and remain at a 0 count in resources_assigned for the node while resources_available remains at 2.

The high priority queues by contrast set default_chunk.xeon6148_sockets = 2 and default_chunk.xeon8268_sockets = 0 – or the opposite depending on configuration. The total number of sockets required shows up in Resource_List and is tracked in resources_assigned for the node in pbsnodes.

I believe to help the scheduler figure out that the jobs in s1 and s2 have the needed resources, I’ll need to set a hook at runjob to look at the node lists and at minimum update the Resource_List. I’m not clear if I’ll have to do something at execjob_begin or execjob_prologue to set job resources_used and/or the status of vnodes.