Is qdel for a job array with 5k supposed to stop the scheduler?

What we have seen is that by executing qdel [].<pbs_server> e.g. qdel 101[].pbs with 5,000 sub-jobs will cause the scheduler to halt and have the scheduler cycle reach timeout after 20 minutes.

is that expected? or a bug?


Could you please time the 5K array job deletion as below

qdel 101[]
qdel 102[].pbs

Thank you

No, this is not supposed to happen. I tried submitting a job array with 1M jobs and still didn’t see the scheduler getting halted or timeout:

[ragrawal@blrentperf04 ~]$ qsub -J 1-1000000 -- /bin/sleep 100
[ragrawal@blrentperf04 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1000002[].blrent* STDIN            ragrawal                 0 B workq           
[ragrawal@blrentperf04 ~]$ qstat -t | wc -l
[ragrawal@blrentperf04 ~]$ qdel 1000002[]
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
[ragrawal@blrentperf04 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1000003.blrentpe* STDIN            ragrawal          00:00:00 R workq           
1000004.blrentpe* STDIN            ragrawal                 0 R workq           
1000005.blrentpe* STDIN            ragrawal                 0 R workq           
[ragrawal@blrentperf04 ~]$

The qdel did take a few minutes, but scheduler didn’t time out.

What do you see happening in those 20 minutes in the sched logs? Do you see “Considering job to run” lines or something else? Also, do you see the server getting hung as well? does it respond to other requests while qdel is happening?

Thank you for the details, can you please tell me which version of PBS are you using?
we are using v19.3.1, perhaps it’s a known issues with the version we are using?

we don’t get “Considering job to run”, here is a snippet of our log file (I’ve obfoscuated some details):

06/23/2021 18:05:39;0040;pbs_sched;Job;xxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:15:39;0040;pbs_sched;Sched;xxxxxxx.pbs;Failed to update attr ‘comment’ = Not Running: Insufficient amount of resource: ncpus : Premature end of message (15031)
06/23/2021 18:15:40;0040;pbs_sched;Job;xxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:25:40;0040;pbs_sched;Sched;xxxxxxx.pbs;Failed to update attr ‘comment’ = Not Running: Insufficient amount of resource: ncpus : Premature end of message (15031)
06/23/2021 18:25:43;0040;pbs_sched;Job;xxxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:25:43;0040;pbs_sched;Sched;toolong;Leaving the scheduling cycle: Cycle duration of 1229 seconds has exceeded sched_cycle_length of 1200 seconds
06/23/2021 18:30:55;0040;pbs_sched;Job;xxxxxx[424].pbs;Job preempted by suspension
06/23/2021 18:30:56;0040;pbs_sched;Job;xxxxxx.pbs;Job run
06/23/2021 18:31:03;0040;pbs_sched;Job;xxxxxx[299].pbs;Job preempted by suspension
06/23/2021 18:31:04;0040;pbs_sched;Job;xxxxxx[22].pbs;Job preempted by suspension
06/23/2021 18:31:04;0040;pbs_sched;Job;xxxxxx.pbs;Job run


I tested using the latest master code.

It seems like the scheduler was not be able to send messages to the server and the request timed out … can you look into the server logs to see what it was doing between 18:05:39 and 18:15:39 ?


06/23/2021 18:05:19;0010;Server@pbs1;Job;xxxxxxx.pbs1;Exit_status=0 resources_used.cpupercent=97 resources_used.cput=01:00:16 resources_used.mem=647844kb resources_used.ncpus=1 resources_used.vmem=2507768kb resources_used.walltime=01:00:11
06/23/2021 18:25:06;0008;Server@pbs1;Job;xxxxxxx[3852].pbs1;Job sent signal SIGKILL on delete
06/23/2021 18:25:19;0008;Server@pbs1;Job;xxxxxxx[3894].pbs1;Job sent signal SIGKILL on delete