Some jobs stay queued for extended periods of time despite availability of hosts

Hi, We’re using the 23.06 release of OpenPBS and facing an issue where most jobs run on-time successfully but a small number of jobs stay queued for extended periods of time despite nodes being available.

Investigation:
Lets take PBS Job ID 1094 as an example. Here, the job gets submitted at 02/19/2025 21:00:08 and doesn’t actually run until 02/19/2025 21:53:24 despite nodes being available. This job was submitted to the custom “medium” queue.

The logs for this look like:


Line    3:   Line 48308: 02/19/2025 21:00:08;0080;pbs_sched;Job;1094.ip-10-0-6-20;Considering job to run
Line    4:   Line 48309: 02/19/2025 21:00:08;0040;pbs_sched;Job;1094.ip-10-0-6-20;Not enough total nodes available
Line    5:   Line 48310: 02/19/2025 21:00:08;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job will never run with the resources currently configured in the complex
...
...
...
...
Line 2986:   Line 71548: 02/19/2025 21:53:19;0040;pbs_sched;Job;1094.ip-10-0-6-20;Not enough total nodes available
Line 2987:   Line 71549: 02/19/2025 21:53:19;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job will never run with the resources currently configured in the complex
Line 2988:   Line 71575: 02/19/2025 21:53:24;0080;pbs_sched;Job;1094.ip-10-0-6-20;Considering job to run
Line 2989:   Line 71576: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 2990:   Line 71578: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 2991:   Line 71579: 02/19/2025 21:53:24;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job run
Line 2992:   Line 71799: 02/19/2025 21:54:04;0400;pbs_sched;Job;1094.ip-10-0-6-20;Simulation: job end point [Mon Feb 18 21:53:25 2030]

In the PBS server logs, I see these messages an hour before the job submission:


Line 33120: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33132: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33144: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33156: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible

The nodes in this compute stack remain in use with jobs getting executed on them until there are large gaps where see lots of these “Create node bucket” messages for long periods of time with no chunks getting allocated e.g:

Line 45459: 02/19/2025 20:47:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium

You can see an example of this visually in the screenshot below. The highlighted lines are where “Allocated one subchunk” lines were present indicating node usage for job execution, but then from the bar on the right side you’ll see an extended period of time where we just got the “Created node bucket messages.”

The PBS job ID 1094 was submitted during this period 21:00:08 and remained queued.


Line 48265: 02/19/2025 20:59:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48277: 02/19/2025 21:00:00;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48289: 02/19/2025 21:00:05;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48301: 02/19/2025 21:00:08;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48316: 02/19/2025 21:00:09;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium

This lasts until 21:53:24 when the node becomes available again somehow and Job 1094 gets allocated a chunk on it.


Line 71229: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71259: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71292: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71328: 02/19/2025 21:52:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71372: 02/19/2025 21:53:00;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71396: 02/19/2025 21:53:05;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71420: 02/19/2025 21:53:10;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71444: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71468: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71492: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71516: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71540: 02/19/2025 21:53:19;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71564: 02/19/2025 21:53:24;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71566: 02/19/2025 21:53:24;0400;pbs_sched;Node;1036.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71568: 02/19/2025 21:53:24;0400;pbs_sched;Node;1036.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71571: 02/19/2025 21:53:24;0400;pbs_sched;Node;1037.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71573: 02/19/2025 21:53:24;0400;pbs_sched;Node;1037.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71576: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71578: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1

Some more detail of the job’s lifecycle from start to end:


Line  638789: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: entered for medium
Line  638790: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for medium]
Line  638791: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: entered for medium
Line  638792: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line  638793: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: entered for server
Line  638794: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for server]
Line  638795: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: entered for server
Line  638796: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line  638797: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: entered for medium, alt_res (nil)
Line  638798: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  638799: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: entered for medium, alt_res (nil)
Line  638800: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  638801: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: entered for server, alt_res (nil)
Line  638802: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  638803: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: entered for server, alt_res (nil)
Line  638804: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  638807: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on server ip-10-0-6-20, op_flag f, alt_res_ptr (nil)
Line  638808: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
Line  638809: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line  638810: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  638811: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  638812: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  638813: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  638814: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  638815: 02/19/2025 21:00:08;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 81 and substate to 10
Line  638816: 02/19/2025 21:00:08;0100;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;enqueuing into medium, state Q hop 1
Line  638817: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on queue medium, op_flag f, alt_res_ptr (nil)
Line  638818: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for medium]
Line  638819: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line  638820: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  638821: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  638822: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  638823: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  638824: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  638825: 02/19/2025 21:00:08;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Queued at request of jobsubmit@ip-10-0-6-20.us-west-2.compute.internal, owner = jobsubmit@ip-10-0-6-20.us-west-2.compute.internal, job name = 5af17c33-ddfd-4c14-8fd6-fef115006a15, queue = medium
Line  638854: 02/19/2025 21:00:08;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Modified at request of Scheduler@ip-10-0-6-20.us-west-2.compute.internal
Line  639138: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on server ip-10-0-6-20, op_flag e, alt_res_ptr 0x1619a860
Line  639139: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res 0x1619a860]
Line  639140: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  639141: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res 0x1619a860]
Line  639142: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  639143: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  639144: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on queue medium, op_flag e, alt_res_ptr 0x1619a860
Line  639145: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res 0x1619a860]
Line  639146: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  639147: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res 0x1619a860]
Line  639148: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  639149: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  639150: 02/19/2025 21:00:11;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Modified at request of root@ip-10-0-6-20.us-west-2.compute.internal
Line  770201: 02/19/2025 21:53:24;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Run at request of Scheduler@ip-10-0-6-20.us-west-2.compute.internal on exec_vnode (ip-10-0-5-174:ncpus=16)
Line  770202: 02/19/2025 21:53:24;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 82 and substate to 41
Line  770203: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-10-0-6-20, op_flag 7, alt_res_ptr (nil)
Line  770204: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line  770205: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  770206: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  770207: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  770208: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue medium, op_flag 7, alt_res_ptr (nil)
Line  770209: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line  770210: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  770211: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  770212: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  770751: 02/19/2025 21:53:27;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received session ID for job: 388959
Line  770752: 02/19/2025 21:53:27;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 82 and substate to 42
Line  770893: 02/19/2025 21:53:37;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  773173: 02/19/2025 21:53:53;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  774499: 02/19/2025 21:54:15;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  777516: 02/19/2025 21:54:44;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  779663: 02/19/2025 21:55:18;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  784453: 02/19/2025 21:55:58;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  787699: 02/19/2025 21:56:44;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  791797: 02/19/2025 21:57:37;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  794856: 02/19/2025 21:58:29;0080;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Obit received momhop:1 serverhop:1 state:R substate:42
Line  794857: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 50
Line  794863: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 51
Line  794870: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 52
Line  794871: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 53
Line  795266: 02/19/2025 21:58:30;0010;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Exit_status=1 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=38944kb resources_used.ncpus=16 resources_used.vmem=3293276kb resources_used.walltime=00:05:03
Line  795267: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-10-0-6-20, op_flag b, alt_res_ptr (nil)
Line  795268: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
Line  795269: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  795270: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  795271: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  795272: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue medium, op_flag b, alt_res_ptr (nil)
Line  795273: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for medium]
Line  795274: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  795275: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  795276: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0

Outcome:
Because of this issue, we see spikes in job start times causing failures in other parts of the system waiting for the job to run on time. There were compute nodes available during this time period

The questions we have are:

  • Why does PBS mark certain compute stacks as ineligible ?
  • Why do some nodes go into this create_node_buckets loop where no jobs get run on them for extended periods of time ?

Some help in understanding this issue would be greatly appreciated.

Thanks

Could you please try this configuration:

On the pbs server host as root user:

  • source /etc/pbs.conf
  • edit $PBS_HOME/sched_priv/sched_config
  • find the below line

provision_policy: “aggressive_provision”

and update it to

provision_policy: “avoid_provision”

  • save and exit the file
  • kill -HUP < PID of the pbs_sched process>
  • qmgr -c “set server scheduling = t”

Hope this resolves your issue.

Thanks for the suggestion, this is an intermittent issue and we’ll try to reproduce it again with this config in place. I’ll report back if this resolved the issue.

1 Like