Hi,
What we have seen is that by executing qdel [].<pbs_server> e.g. qdel 101[].pbs
with 5,000 sub-jobs will cause the scheduler to halt and have the scheduler cycle reach timeout after 20 minutes.
is that expected? or a bug?
Thanks,
Roy
Hi,
What we have seen is that by executing qdel [].<pbs_server> e.g. qdel 101[].pbs
with 5,000 sub-jobs will cause the scheduler to halt and have the scheduler cycle reach timeout after 20 minutes.
is that expected? or a bug?
Thanks,
Roy
Could you please time the 5K array job deletion as below
qdel 101[]
qdel 102[].pbs
Thank you
No, this is not supposed to happen. I tried submitting a job array with 1M jobs and still didn’t see the scheduler getting halted or timeout:
[ragrawal@blrentperf04 ~]$ qsub -J 1-1000000 -- /bin/sleep 100
1000002[].blrentperf04
[ragrawal@blrentperf04 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1000002[].blrent* STDIN ragrawal 0 B workq
[ragrawal@blrentperf04 ~]$ qstat -t | wc -l
1000003
[ragrawal@blrentperf04 ~]$ qdel 1000002[]
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
1000003.blrentperf04
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
1000004.blrentperf04
[ragrawal@blrentperf04 ~]$ qsub -- /bin/sleep 100
1000005.blrentperf04
[ragrawal@blrentperf04 ~]$ qstat
Job id Name User Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
1000003.blrentpe* STDIN ragrawal 00:00:00 R workq
1000004.blrentpe* STDIN ragrawal 0 R workq
1000005.blrentpe* STDIN ragrawal 0 R workq
[ragrawal@blrentperf04 ~]$
The qdel did take a few minutes, but scheduler didn’t time out.
What do you see happening in those 20 minutes in the sched logs? Do you see “Considering job to run” lines or something else? Also, do you see the server getting hung as well? does it respond to other requests while qdel is happening?
Thank you for the details, can you please tell me which version of PBS are you using?
we are using v19.3.1, perhaps it’s a known issues with the version we are using?
we don’t get “Considering job to run”, here is a snippet of our log file (I’ve obfoscuated some details):
06/23/2021 18:05:39;0040;pbs_sched;Job;xxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:15:39;0040;pbs_sched;Sched;xxxxxxx.pbs;Failed to update attr ‘comment’ = Not Running: Insufficient amount of resource: ncpus : Premature end of message (15031)
06/23/2021 18:15:40;0040;pbs_sched;Job;xxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:25:40;0040;pbs_sched;Sched;xxxxxxx.pbs;Failed to update attr ‘comment’ = Not Running: Insufficient amount of resource: ncpus : Premature end of message (15031)
06/23/2021 18:25:43;0040;pbs_sched;Job;xxxxxxx.pbs;Insufficient amount of resource: ncpus
06/23/2021 18:25:43;0040;pbs_sched;Sched;toolong;Leaving the scheduling cycle: Cycle duration of 1229 seconds has exceeded sched_cycle_length of 1200 seconds
06/23/2021 18:30:55;0040;pbs_sched;Job;xxxxxx[424].pbs;Job preempted by suspension
06/23/2021 18:30:56;0040;pbs_sched;Job;xxxxxx.pbs;Job run
06/23/2021 18:31:03;0040;pbs_sched;Job;xxxxxx[299].pbs;Job preempted by suspension
06/23/2021 18:31:04;0040;pbs_sched;Job;xxxxxx[22].pbs;Job preempted by suspension
06/23/2021 18:31:04;0040;pbs_sched;Job;xxxxxx.pbs;Job run
Thanks,
Roy
I tested using the latest master code.
It seems like the scheduler was not be able to send messages to the server and the request timed out … can you look into the server logs to see what it was doing between 18:05:39 and 18:15:39 ?
nothing!
06/23/2021 18:05:19;0010;Server@pbs1;Job;xxxxxxx.pbs1;Exit_status=0 resources_used.cpupercent=97 resources_used.cput=01:00:16 resources_used.mem=647844kb resources_used.ncpus=1 resources_used.vmem=2507768kb resources_used.walltime=01:00:11
06/23/2021 18:25:06;0008;Server@pbs1;Job;xxxxxxx[3852].pbs1;Job sent signal SIGKILL on delete
06/23/2021 18:25:19;0008;Server@pbs1;Job;xxxxxxx[3894].pbs1;Job sent signal SIGKILL on delete
…