Qmove d job is overtaken?

Hello,

I’m using a version 19.1.2 with CentOS7.6.
My test cluster has head node (server) and one cluster node(execution).
There are two queues - ‘workq’ and ‘que2’ to test the behavior of qmove.

I found a strange behavior of 'qmove’ed job.

  1. A normal user moved his job from ‘workq’ to ‘que2’.
    The Command is ;

    ‘qmove que2 13.share0A’

Then, the moved job (13.share0A) is shown at the top of ‘qstat’ output.

Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
13.share0A      user009  que2     OPENSSB       --    1  12  8192m 02:00 Q   --
12.share0A      user009  workq    OPENSSA    149275   1  12  8192m 02:00 R 00:00
14.share0A      user009  workq    OPENSSC       --    1  12  8192m 02:00 Q   --
15.share0A      user009  que2     OPENS2A       --    1  12  8192m 02:00 Q   --
16.share0A      user009  que2     OPENS2A       --    1  12  8192m 02:00 Q   --
    ;
  • Is this normal behavior ?
  • Why the ‘‘moved job’’ 13.share0A job is printed out at the top of ‘qstat’ list?
  1. The moved job never starts (overtaken)
    When all queued job('Q) in queue ‘que2’ had finished, this 13.share0A job started.
    There is a job in ‘que2’ which queued time(qtime) is later than the qtime of the job.13.
    So, the job 13.share0A was overtaken by the jobs came later.

    Here is my ovservation:

(1) The moved Job.13 is waiting in que2 queue.
Job.19 was running.
Job.21 was submitted later than the qtime of Job.13

Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
13.share0A      user009  que2     OPENSSB       --    1  12  8192m 02:00 Q   --
19.share0A      user009  que2     OPENS2D    151511   1  12  8192m 02:00 R 00:02
21.share0A      user009  que2     OPENS2E       --    1  12  8192m 02:00 Q   --

   Job ID=13 : moved job  (qtime = Mon Sep  9 15:15:37 2019)
   Job ID=19 : 'R' (qtime = Mon Sep  9 15:15:04 2019)
   Job ID=21 : 'Q' (qtime = Mon Sep  9 15:17:15 2019)

(2) Job.21 overtake Job.13
Job.21 starts before Job.13

Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
13.share0A      user009  que2     OPENSSB       --    1  12  8192m 02:00 Q   --
21.share0A      user009  que2     OPENS2E    151787   1  12  8192m 02:00 R 00:00
  • Again, Is this normal behavior of moved job?
  • Why the ‘‘moved job’’ 13.share0A is overtaken by later jobs?

I appreciate if someone could answer the questions above.
If this is a bug, are there any quick workarounds exists ?

Best Regards,
–Sudo

Please note, if you could provide your $PBS_HOME/sched_priv/sched_config , pbsnodes -av , qstat -Bf output , qsub statement of these jobs , then it would be helpful.

Generally, (otherwise depends on what is being set in the sched_config )

  • The queues are sorted in the order of their priority or creation time.
    • jobs from the first sorted queue are run
    • jobs from the next sorted queue will be run next
  • The behaviour you have seen is normal

Please check these by_queue , strict_ordering and job_sort_formula from the PBS Professional admin guide

Please note , if you want the jobs to be run in the order of their submission ( or jobid) you need to configure STRICT FIFO. This will lead to under utilization of the cluster for eg: if the first job cannot run , then the next job which can run now is not given the chance to backfill and run instead. So one rogue job can render the cluster idle , until admin notice it and do the needful.

If you would tell what what kind of scheduling of jobs you would like to achieve , that would be helpful.

Hello, adarsh,

Thank you for the reply.
I’m sending sched_config first. But it is default. no modification made.
pbsnodes, qstat -Bf will be available later.

As for qsub parameters, they are ;

#PBS -A TESTJOB
#PBS -q workq
#PBS -N OPENSS
#PBS -l select=1:ncpus=12:mem=4096MB:mpiprocs=12
#PBS -l walltime=02:00:00
#PBS -j eo

I haven’t set anything special parameters/configurations for my PBSpro.
It is default from installation.
One thing I made was ‘create queue que2’.

I wanted to know why job.13 (moved from workq to que2 queue) was overtaken
by job.21.
Apparently, qtime of the Job.21 is later than Job.13.
Is this default specification of PBSpro?

Best Regards,
–Sudo

Hello,
As I can’t figure out how to attach text file(sched_config) to my article.
Here is my sched_config file. (removed comment and blank lines).

rsudo@rsudo-X8SAX:/tmp$ grep -v “^#” sched_config | sed ‘/^$/d’
round_robin: False all
by_queue: True prime
by_queue: True non_prime
strict_ordering: false ALL
help_starving_jobs: true ALL
max_starve: 24:00:00
backfill_prime: false ALL
prime_exempt_anytime_queues: false
primetime_prefix: p_
nonprimetime_prefix: np_
node_sort_key: “sort_priority HIGH” ALL
provision_policy: “aggressive_provision”
sort_queues: true ALL
resources: “ncpus, mem, arch, host, vnode, aoe, eoe”
load_balancing: false ALL
smp_cluster_dist: pack
fair_share: false ALL
fairshare_usage_res: cput
fairshare_entity: euser
fairshare_decay_time: 24:00:00
fairshare_decay_factor: 0.5
preemptive_sched: true ALL
preempt_queue_prio: 150
preempt_prio: “express_queue, normal_jobs”
preempt_order: “SCR”
preempt_sort: min_time_since_start
dedicated_prefix: ded
log_filter: 3328

Thanks,
–Sudo

Addendum info.

Here are ‘pbsnodes -av’ and ‘qstat -Bf’.

[root@share0A ~]# pbsnodes -av
o610f001
     Mom = o610f001
     Port = 15002
     pbs_version = 19.1.2
     ntype = PBS
     state = free
     pcpus = 12
     resources_available.arch = linux
     resources_available.host = o610f001
     resources_available.mem = 263542952kb
     resources_available.ncpus = 12
     resources_available.vnode = o610f001
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Sep 10 13:30:10 2019
     last_used_time = Mon Sep  9 15:45:03 2019


[root@share0A ~]# qstat -Bf
Server: share0A
    server_state = Active
    server_host = share0a
    scheduling = True
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
	:0 
    default_queue = workq
    log_events = 511
    mail_from = adm
    query_other_jobs = True
    resources_default.ncpus = 1
    default_chunk.ncpus = 1
    scheduler_iteration = 600
    FLicenses = 20000000
    resv_enable = True
    node_fail_requeue = 310
    max_array_size = 10000
    pbs_license_min = 0
    pbs_license_max = 2147483647
    pbs_license_linger_time = 31536000
    license_count = Avail_Global:10000000 Avail_Local:10000000 Used:0 High_Use:
	0
    pbs_version = 19.1.2
    eligible_time_enable = False
    max_concurrent_provision = 5
    power_provisioning = False
    max_job_sequence_id = 9999999

Thank you,
–Sudo

qmgr -c “p s” shows this output.

[root@share0A ~]# qmgr -c "p s"
#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq enabled = True
set queue workq started = True
#
# Create and define queue que2
#
create queue que2
set queue que2 queue_type = Execution
set queue que2 enabled = True
set queue que2 started = True
#
# Set server attributes.
#
set server scheduling = True
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999

Thanks,
–Sudo

Hello,

I modified “strict_ordering: True ALL” in sched_config. and restart pbs.
I tried the same job mix test with 'qmove’d job.
The result is the same as before.
The qmoved job was overtaken by later job.


“strict_ordering: True ALL”

Note, the jobs are the same job here .

   #PBS -l select=1:ncpus=6:mem=8192MB:mpiprocs=6
   #PBS -l walltime=02:00:00

(1) First, Before qmove Jid.33

Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
31.share0A      user009  workq    JOBW1       22350   1   6  8192m 02:00 R 00:00
32.share0A      user009  que2     JOBQ1       22370   1   6  8192m 02:00 R 00:00
33.share0A      user009  workq    JOBW2         --    1   6  8192m 02:00 Q   -- 
34.share0A      user009  que2     JOBQ2         --    1   6  8192m 02:00 Q   -- 
35.share0A      user009  workq    JOBW3         --    1   6  8192m 02:00 Q   -- 
36.share0A      user009  que2     JOBQ3         --    1   6  8192m 02:00 Q   -- 

There are same 6 jobs are submitted in queue workq and que2:
‘workq’ : JID 31, 33, 35
‘que2’ : JID 32, 34, 36

(2) After qmoved Jid.33 from workq -> que2

[user009@share0A ~]$ qmove que2 33.share0A ; qstat -a

Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
33.share0A      user009  que2     JOBW2         --    1   6  8192m 02:00 Q   -- 
31.share0A      user009  workq    JOBW1       22350   1   6  8192m 02:00 R 00:00
32.share0A      user009  que2     JOBQ1       22370   1   6  8192m 02:00 R 00:00
34.share0A      user009  que2     JOBQ2         --    1   6  8192m 02:00 Q   -- 
35.share0A      user009  workq    JOBW3         --    1   6  8192m 02:00 Q   -- 
36.share0A      user009  que2     JOBQ3         --    1   6  8192m 02:00 Q   -- 
  • JID.33 moved to que2 queue from workq.
  • qstat -a lists JID.33 at the top of the list.

(3) Submit an another job(JID.37) into que2

[user009@share0A ~]$ qsub jobq4 ; qstat -a
37.share0A

Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
33.share0A      user009  que2     JOBW2         --    1   6  8192m 02:00 Q   -- 
31.share0A      user009  workq    JOBW1       22350   1   6  8192m 02:00 R 00:01
32.share0A      user009  que2     JOBQ1       22370   1   6  8192m 02:00 R 00:01
34.share0A      user009  que2     JOBQ2         --    1   6  8192m 02:00 Q   -- 
35.share0A      user009  workq    JOBW3         --    1   6  8192m 02:00 Q   -- 
36.share0A      user009  que2     JOBQ3         --    1   6  8192m 02:00 Q   -- 
37.share0A      user009  que2     JOBQ4         --    1   6  8192m 02:00 Q   --  << NEW JOB>>

Note: JID 37 is the same job to others.

(4) Job’s qtime

Apparently, JID.37 is later than JID.33.

[user009@share0A ~]$ qstat -f | egrep "^Job|qtime"
Job Id: 33.share0A
    qtime = Tue Sep 10 15:21:50 2019
Job Id: 31.share0A
    qtime = Tue Sep 10 15:20:45 2019
Job Id: 32.share0A
    qtime = Tue Sep 10 15:20:45 2019
Job Id: 34.share0A
    qtime = Tue Sep 10 15:20:55 2019
Job Id: 35.share0A
    qtime = Tue Sep 10 15:21:05 2019
Job Id: 36.share0A
    qtime = Tue Sep 10 15:21:05 2019
Job Id: 37.share0A
    qtime = Tue Sep 10 15:22:44 2019

The start order of jobs should be ;
31,32,34,35,36, 33, 37.

(4) Sometime later, the last job(JID.37) started before JID.33

JID.37 overtakes JID.33 (!)

Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
33.share0A      user009  que2     JOBW2         --    1   6  8192m 02:00 Q   -- 
36.share0A      user009  que2     JOBQ3       22890   1   6  8192m 02:00 R 00:00
37.share0A      user009  que2     JOBQ4       22910   1   6  8192m 02:00 R 00:00

(5) Output files

JID.33 is the last job.

-rw-------  1 user009     1572 Sep 10 15:23 JOBW1.e31
-rw-------  1 user009     1691 Sep 10 15:23 JOBQ1.e32
-rw-------  1 user009     1694 Sep 10 15:26 JOBW3.e35
-rw-------  1 user009     1691 Sep 10 15:26 JOBQ2.e34
-rw-------  1 user009     1691 Sep 10 15:29 JOBQ4.e37
-rw-------  1 user009     1691 Sep 10 15:29 JOBQ3.e36
-rw-------  1 user009     1570 Sep 10 15:32 JOBW2.e33

sched_config :

[root@share0A ~]# egrep "strict_ordering|round_robin|job_sort_key|fair_share|backfill_depth|job_sort_formula" /var/spool/pbs/sched_priv/sched_config
# round_robin
round_robin: False	all
#	Run jobs by queues.  If both round_robin and by_queue are not set,
# strict_ordering
#	while adhering to site policy.  Note that strict_ordering can result
strict_ordering: True 	ALL
#	      Use job_sort_formula specifying eligible_time to establish equivalent behavior.
#	to the backfill_depth server parameter.
# job_sort_key 
#	job_sort_key allows jobs to be sorted by any resource.  This 
#	Usage: job_sort_key: "resource_name HIGH | LOW"
#		fair_share_perc, preempt_priority, job_priority
#	job_sort_key: "ncpus HIGH"
#	job_sort_key: "mem LOW"
#job_sort_key: "cput LOW"	ALL
#	node_sort_key is similar to job_sort_key but works for nodes.  
#	round_robin - run one job on each node in a cycle
# fair_share 
fair_share: false	ALL
#	      Uncomment this line (and turn on fair_share above)
#				SEE: fair_share

[root@share0A ~]# pbsnodes -av
o610f001
     Mom = o610f001
     Port = 15002
     pbs_version = 19.1.2
     ntype = PBS
     state = free
     pcpus = 12
     resources_available.arch = linux
     resources_available.host = o610f001
     resources_available.mem = 263542952kb
     resources_available.ncpus = 12
     resources_available.vnode = o610f001
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.hbmem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     last_state_change_time = Tue Sep 10 15:29:47 2019
     last_used_time = Tue Sep 10 15:32:48 2019

[root@share0A ~]# qstat -Bf
Server: share0A
    server_state = Active
    server_host = share0a
    scheduling = True
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 Begun
	:0 
    default_queue = workq
    log_events = 511
    mail_from = adm
    query_other_jobs = True
    resources_default.ncpus = 1
    default_chunk.ncpus = 1
    resources_assigned.mem = 0mb
    resources_assigned.mpiprocs = 0
    resources_assigned.ncpus = 0
    resources_assigned.nodect = 0
    scheduler_iteration = 600
    FLicenses = 20000000
    resv_enable = True
    node_fail_requeue = 310
    max_array_size = 10000
    pbs_license_min = 0
    pbs_license_max = 2147483647
    pbs_license_linger_time = 31536000
    license_count = Avail_Global:10000000 Avail_Local:10000000 Used:0 High_Use:
	0
    pbs_version = 19.1.2
    eligible_time_enable = False
    max_concurrent_provision = 5
    power_provisioning = False
    max_job_sequence_id = 9999999

Hello,

I’m seeing the following ‘estimated.***’ information in the qstat -f 33 (which is the qmoved job).
I don’t see these lines for other jobs by ‘qstat -f’.

estimated.exec_vnode = (o610f001:ncpus=6:mem=8388608kb)   
estimated.start_time = Tue Sep 10 17:20:45 2019	      

The estimated start time here is way too late time than actually JID.33 started.
JID.33 started and ended below;
Start: Tue Sep 10 15:29:46 JST 2019
End: Tue Sep 10 15:32:46 JST 2019

I appreciate if someone explain these “estimated.xxxxxx” parameters given to qmoved job.

Is this the reason why JID.33 was overtaken by the job following?

Best Regards,
–Sudo

Thank you @rsudo for the above information and experiments. I can easily replicate your use case.
For some reason qmoved job is only considered at the end , unless there is no other job
that can run now in that queue. I will have to resort to @mkaro and @scc for any clues or explanation.

I was also able to easily reproduce this behavior. I performed a tracejob and saw the following lines:

09/10/2019 11:36:50  S    dequeuing from workq, state 1
09/10/2019 11:36:50  S    enqueuing into que2, state 1 hop 1
09/10/2019 11:36:50  S    Job moved to que2 at request of myuserid@host.domain
09/10/2019 11:37:07  L    Considering job to run
09/10/2019 11:37:07  S    Job Modified at request of Scheduler@host.domain
09/10/2019 11:37:07  L    Not enough free nodes available
09/10/2019 11:37:08  L    Considering job to run

The last two lines were repeated for each scheduling cycle until the job finally ran. I’m not sure what the scheduler modified at 11:37:07, but presumably it changed something. This may or may not be the source of the problem. What do you think @bhroam and @arungrover?

The other issue here is the ordering of the job IDs in the qstat output. They should be in numeric order, but they are not after the qmove. I don’t believe that has anything to do with the scheduler, but is a problem with the server because it appears immediately after the qmove but before the scheduling cycle was run.

At this point I think the best thing would be to file an issue in GitHub so we can schedule time for someone to take a closer look at this.

1 Like

I think this issue is a side effect of a ‘qorder’ issue that we fixed long time back. What happens is that qorder dequeues the said jobs and enqueues back again. Every time a job is enqueued it is given a new qrank number (which is essentially time in milliseconds). This attribute is then used in scheduler to sort the jobs in the order they were submitted.
Now in case of qmove, qrank attribute is not getting set properly. Instead of setting the qrank attribute to epoch in milliseconds it is setting it in seconds which means while sorting the number it shows is much smaller than qrank set on other jobs and that results into the job taking the top slot in the qstat list.
To fix the problem we need to change svr_movejob.c to set “JOB_ATR_qrank” to time in milliseconds when the job is moved.

2 Likes

Thank you @mkaro , @arungrover and @scc
Thank you @rsudo for reporting this case.