Is qdel for a job array with 5k supposed to stop the scheduler?

roy · June 28, 2021, 11:52pm

Hi,
What we have seen is that by executing qdel [].<pbs_server> e.g. qdel 101[].pbs with 5,000 sub-jobs will cause the scheduler to halt and have the scheduler cycle reach timeout after 20 minutes.

is that expected? or a bug?

Thanks,
Roy

adarsh · June 29, 2021, 8:25am

Could you please time the 5K array job deletion as below

qdel 101[]
qdel 102[].pbs

Thank you

agrawalravi90 · June 29, 2021, 4:57pm

No, this is not supposed to happen. I tried submitting a job array with 1M jobs and still didn’t see the scheduler getting halted or timeout:

[ragrawal@blrentperf04 ~]$ qsub -J 1-1000000 -- /bin/sleep 100
1000002[].blrentperf04
[ragrawal@blrentperf04 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1000002[].blrent* STDIN            ragrawal                 0 B workq           
[ragrawal@blrentperf04 ~]$ qstat -t | wc -l
1000003
[ragrawal@blrentperf04 ~]$ qdel 1000002[]
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
1000003.blrentperf04
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
1000004.blrentperf04
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
1000005.blrentperf04
[ragrawal@blrentperf04 ~]$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
1000003.blrentpe* STDIN            ragrawal          00:00:00 R workq           
1000004.blrentpe* STDIN            ragrawal                 0 R workq           
1000005.blrentpe* STDIN            ragrawal                 0 R workq           
[ragrawal@blrentperf04 ~]$

The qdel did take a few minutes, but scheduler didn’t time out.

What do you see happening in those 20 minutes in the sched logs? Do you see “Considering job to run” lines or something else? Also, do you see the server getting hung as well? does it respond to other requests while qdel is happening?

roy · June 30, 2021, 2:02am

Thank you for the details, can you please tell me which version of PBS are you using?
we are using v19.3.1, perhaps it’s a known issues with the version we are using?

we don’t get “Considering job to run”, here is a snippet of our log file (I’ve obfoscuated some details):

06/23/2021 18:05:39;0040;pbs_sched;Job;xxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:15:39;0040;pbs_sched;Sched;xxxxxxx.pbs;Failed to update attr ‘comment’ = Not Running: Insufficient amount of resource: ncpus : Premature end of message (15031)
06/23/2021 18:15:40;0040;pbs_sched;Job;xxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:25:40;0040;pbs_sched;Sched;xxxxxxx.pbs;Failed to update attr ‘comment’ = Not Running: Insufficient amount of resource: ncpus : Premature end of message (15031)
06/23/2021 18:25:43;0040;pbs_sched;Job;xxxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:25:43;0040;pbs_sched;Sched;toolong;Leaving the scheduling cycle: Cycle duration of 1229 seconds has exceeded sched_cycle_length of 1200 seconds
06/23/2021 18:30:55;0040;pbs_sched;Job;xxxxxx[424].pbs;Job preempted by suspension
06/23/2021 18:30:56;0040;pbs_sched;Job;xxxxxx.pbs;Job run
06/23/2021 18:31:03;0040;pbs_sched;Job;xxxxxx[299].pbs;Job preempted by suspension
06/23/2021 18:31:04;0040;pbs_sched;Job;xxxxxx[22].pbs;Job preempted by suspension
06/23/2021 18:31:04;0040;pbs_sched;Job;xxxxxx.pbs;Job run

Thanks,
Roy

agrawalravi90 · June 30, 2021, 7:11pm

I tested using the latest master code.

It seems like the scheduler was not be able to send messages to the server and the request timed out … can you look into the server logs to see what it was doing between 18:05:39 and 18:15:39 ?

roy · July 2, 2021, 12:03am

nothing!

06/23/2021 18:05:19;0010;Server@pbs1;Job;xxxxxxx.pbs1;Exit_status=0 resources_used.cpupercent=97 resources_used.cput=01:00:16 resources_used.mem=647844kb resources_used.ncpus=1 resources_used.vmem=2507768kb resources_used.walltime=01:00:11
06/23/2021 18:25:06;0008;Server@pbs1;Job;xxxxxxx[3852].pbs1;Job sent signal SIGKILL on delete
06/23/2021 18:25:19;0008;Server@pbs1;Job;xxxxxxx[3894].pbs1;Job sent signal SIGKILL on delete
…

Topic		Replies	Views
Deleting 150k+ queued jobs Users/Site Administrators	9	1457	September 16, 2020
Qdel optimization for a huge number of jobs Developers	2	843	October 27, 2020
Qdel command is very delayed. No feedback from the system Users/Site Administrators	2	563	September 15, 2022
Add support in PTL to speed up deletion of large number of jobs Developers	11	951	February 6, 2019
Cannot qalter jobs in an array Developers	2	1883	November 28, 2016

Is qdel for a job array with 5k supposed to stop the scheduler?

Related topics