Hi Agra,
below is the output:
…
comment = Job run at Fri Aug 06 at 11:16 on (node16:ncpus=1:mem=10240kb)
comment = Job run at Fri Aug 06 at 11:16 on (node16:ncpus=1:mem=10240kb)
comment = Job run at Fri Aug 06 at 11:16 on (node16:ncpus=1:mem=10240kb)
comment = Job run at Fri Aug 06 at 11:16 on (node16:ncpus=1:mem=10240kb)
comment = Job run at Fri Aug 06 at 11:16 on (node16:ncpus=1:mem=10240kb)
comment = Job run at Fri Aug 06 at 11:16 on (node16:ncpus=1:mem=10240kb)
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
comment = Job Array Began at Fri Aug 06 at 11:16
…
That looks like either all jobs ran, or scheduler didn’t see some of the jobs at all. Can you try triggering another sched cycle (qmgr -c ‘s s scheduling=t’) and check the sched logs to see what happens to the jobs which didn’t get run? sched logs are at $PBS_HOME/sched_logs/
Qmgr: p s
#
# Create resources and set their properties.
#
#
# Create and define resource ngpus
#
create resource ngpus
set resource ngpus type = long
set resource ngpus flag = hn
#
# Create and define resource gpu_id
#
create resource gpu_id
set resource gpu_id type = string
set resource gpu_id flag = h
#
# Create queues and set their attributes.
#
#
# Create and define queue workq
#
create queue workq
set queue workq queue_type = Execution
set queue workq max_user_run = 1000
set queue workq enabled = True
set queue workq started = True
#
# Create and define queue cpu
#
create queue cpu
set queue cpu queue_type = Execution
set queue cpu enabled = True
set queue cpu started = True
#
# Create and define queue gpu
#
create queue gpu
set queue gpu queue_type = Execution
set queue gpu enabled = True
set queue gpu started = True
#
# Create and define queue testq
#
create queue testq
set queue testq queue_type = Execution
set queue testq enabled = True
set queue testq started = True
#
# Set server attributes.
#
set server scheduling = True
set server managers = root@node01
set server managers += root@*
set server default_queue = workq
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server resources_default.place = pack
set server default_chunk.ncpus = 1
set server scheduler_iteration = 600
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 500000
set server default_qsub_arguments = -V
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server job_history_enable = True
set server max_concurrent_provision = 5
set server max_job_sequence_id = 9999999
Job run at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[171].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/2
Job run at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[172].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/3
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[173].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/4
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[174].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/5
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[175].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/6
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[176].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/7
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[177].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/8
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[178].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/9
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[179].nodemgr01 testhpc workq Test -- 1 1 10mb -- R -- node15/10
Job was sent for execution at Sun Aug 08 at 18:54 on (node15:ncpus=1:mem=10240kb)
4057[180].nodemgr01 testhpc workq Test -- 1 1 10mb -- Q -- --
Job Array Began at Sun Aug 08 at 18:54
4057[181].nodemgr01 testhpc workq Test -- 1 1 10mb -- Q -- --
Job Array Began at Sun Aug 08 at 18:54
4057[182].nodemgr01 testhpc workq Test -- 1 1 10mb -- Q -- --
Job Array Began at Sun Aug 08 at 18:54
4057[183].nodemgr01 testhpc workq Test -- 1 1 10mb -- Q -- --
Job Array Began at Sun Aug 08 at 18:54
4057[184].nodemgr01 testhpc workq Test -- 1 1 10mb -- Q -- --
Job Array Began at Sun Aug 08 at 18:54
4057[185].nodemgr01 testhpc workq Test -- 1 1 10mb -- Q -- --
Job Array Began at Sun Aug 08 at 18:54
4057[186].nodemgr01 testhpc workq Test -- 1 1 10mb -- Q -- --
Job Array Began at Sun Aug 08 at 18:54
8/08/2021 18:54:29;0040;pbs_sched;Job;4057[252].nodemgr01;Job run
08/08/2021 18:54:29;0080;pbs_sched;Req;;Leaving Scheduling Cycle
08/08/2021 18:54:29;0080;pbs_sched;Req;;Starting Scheduling Cycle
08/08/2021 18:54:29;0004;pbs_sched;Fil;holidays;The holiday file is out of date; please update it.
08/08/2021 18:54:29;0080;pbs_sched;Job;4057[].nodemgr01;Considering job to run
08/08/2021 18:54:29;0040;pbs_sched;Job;4057[225].nodemgr01;Job run
08/08/2021 18:54:29;0080;pbs_sched;Job;4057[].nodemgr01;Considering job to run
08/08/2021 18:54:29;0040;pbs_sched;Job;4057[226].nodemgr01;Job run
08/08/2021 18:54:29;0080;pbs_sched;Job;4057[].nodemgr01;Considering job to run
08/08/2021 18:54:29;0040;pbs_sched;Job;4057[227].nodemgr01;Job run
08/08/2021 18:54:29;0080;pbs_sched;Job;4057[].nodemgr01;Considering job to run
08/08/2021 18:54:29;0040;pbs_sched;Job;4057[228].nodemgr01;Job run
08/08/2021 18:54:29;0080;pbs_sched;Job;4057[].nodemgr01;Considering job to run
08/08/2021 18:54:29;0040;pbs_sched;Job;4057[229].nodemgr01;Job run
08/08/2021 18:54:29;0080;pbs_sched;Job;4057[].nodemgr01;Considering job to run
08/08/2021 18:54:29;0040;pbs_sched;Job;4057[230].nodemgr01;Job run
08/08/2021 18:54:29;0080;pbs_sched;Job;4057[].nodemgr01;Considering job to run
08/08/2021 18:54:29;0040;pbs_sched;Job;4057[231].nodemgr01;Job run
Thank you for sharing these details @jxdn , much appreciated.
Please check this node, Mom = nodegraph01.hpcc.local and it is mapped to a queue called cpu . The admin might have mapped nodes to queue, hence the queue you have submitted might not have enough nodes assigned to it.
You have submitted jobs to workq, hence your job cannot run on the nodes that are assigned to cpu queue. Check this command and find out the list of nodes
its correct it goes to workq, workq has more than 600 slots. but not all of 600 array job executed. 100+ still queued even the cluster empty (no jobs running)
Please unset these attribute
qmgr: unset server resources_default.place # or else request in your job -l place=free
qmgr: unset queue workq max_user_run