Some jobs stay queued for extended periods of time despite availability of hosts

Hi, We’re using the 23.06 release of OpenPBS and facing an issue where most jobs run on-time successfully but a small number of jobs stay queued for extended periods of time despite nodes being available.

Investigation:
Lets take PBS Job ID 1094 as an example. Here, the job gets submitted at 02/19/2025 21:00:08 and doesn’t actually run until 02/19/2025 21:53:24 despite nodes being available. This job was submitted to the custom “medium” queue.

The logs for this look like:


Line    3:   Line 48308: 02/19/2025 21:00:08;0080;pbs_sched;Job;1094.ip-10-0-6-20;Considering job to run
Line    4:   Line 48309: 02/19/2025 21:00:08;0040;pbs_sched;Job;1094.ip-10-0-6-20;Not enough total nodes available
Line    5:   Line 48310: 02/19/2025 21:00:08;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job will never run with the resources currently configured in the complex
...
...
...
...
Line 2986:   Line 71548: 02/19/2025 21:53:19;0040;pbs_sched;Job;1094.ip-10-0-6-20;Not enough total nodes available
Line 2987:   Line 71549: 02/19/2025 21:53:19;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job will never run with the resources currently configured in the complex
Line 2988:   Line 71575: 02/19/2025 21:53:24;0080;pbs_sched;Job;1094.ip-10-0-6-20;Considering job to run
Line 2989:   Line 71576: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 2990:   Line 71578: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 2991:   Line 71579: 02/19/2025 21:53:24;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job run
Line 2992:   Line 71799: 02/19/2025 21:54:04;0400;pbs_sched;Job;1094.ip-10-0-6-20;Simulation: job end point [Mon Feb 18 21:53:25 2030]

In the PBS server logs, I see these messages an hour before the job submission:


Line 33120: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33132: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33144: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33156: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible

The nodes in this compute stack remain in use with jobs getting executed on them until there are large gaps where see lots of these “Create node bucket” messages for long periods of time with no chunks getting allocated e.g:

Line 45459: 02/19/2025 20:47:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium

You can see an example of this visually in the screenshot below. The highlighted lines are where “Allocated one subchunk” lines were present indicating node usage for job execution, but then from the bar on the right side you’ll see an extended period of time where we just got the “Created node bucket messages.”

The PBS job ID 1094 was submitted during this period 21:00:08 and remained queued.


Line 48265: 02/19/2025 20:59:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48277: 02/19/2025 21:00:00;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48289: 02/19/2025 21:00:05;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48301: 02/19/2025 21:00:08;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48316: 02/19/2025 21:00:09;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium

This lasts until 21:53:24 when the node becomes available again somehow and Job 1094 gets allocated a chunk on it.


Line 71229: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71259: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71292: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71328: 02/19/2025 21:52:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71372: 02/19/2025 21:53:00;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71396: 02/19/2025 21:53:05;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71420: 02/19/2025 21:53:10;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71444: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71468: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71492: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71516: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71540: 02/19/2025 21:53:19;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71564: 02/19/2025 21:53:24;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71566: 02/19/2025 21:53:24;0400;pbs_sched;Node;1036.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71568: 02/19/2025 21:53:24;0400;pbs_sched;Node;1036.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71571: 02/19/2025 21:53:24;0400;pbs_sched;Node;1037.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71573: 02/19/2025 21:53:24;0400;pbs_sched;Node;1037.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71576: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71578: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1

Some more detail of the job’s lifecycle from start to end:


Line  638789: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: entered for medium
Line  638790: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for medium]
Line  638791: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: entered for medium
Line  638792: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line  638793: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: entered for server
Line  638794: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for server]
Line  638795: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: entered for server
Line  638796: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line  638797: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: entered for medium, alt_res (nil)
Line  638798: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  638799: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: entered for medium, alt_res (nil)
Line  638800: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  638801: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: entered for server, alt_res (nil)
Line  638802: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  638803: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: entered for server, alt_res (nil)
Line  638804: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  638807: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on server ip-10-0-6-20, op_flag f, alt_res_ptr (nil)
Line  638808: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
Line  638809: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line  638810: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  638811: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  638812: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  638813: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  638814: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  638815: 02/19/2025 21:00:08;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 81 and substate to 10
Line  638816: 02/19/2025 21:00:08;0100;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;enqueuing into medium, state Q hop 1
Line  638817: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on queue medium, op_flag f, alt_res_ptr (nil)
Line  638818: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for medium]
Line  638819: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line  638820: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  638821: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  638822: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  638823: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  638824: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  638825: 02/19/2025 21:00:08;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Queued at request of jobsubmit@ip-10-0-6-20.us-west-2.compute.internal, owner = jobsubmit@ip-10-0-6-20.us-west-2.compute.internal, job name = 5af17c33-ddfd-4c14-8fd6-fef115006a15, queue = medium
Line  638854: 02/19/2025 21:00:08;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Modified at request of Scheduler@ip-10-0-6-20.us-west-2.compute.internal
Line  639138: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on server ip-10-0-6-20, op_flag e, alt_res_ptr 0x1619a860
Line  639139: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res 0x1619a860]
Line  639140: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  639141: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res 0x1619a860]
Line  639142: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  639143: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  639144: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on queue medium, op_flag e, alt_res_ptr 0x1619a860
Line  639145: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res 0x1619a860]
Line  639146: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  639147: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res 0x1619a860]
Line  639148: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  639149: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  639150: 02/19/2025 21:00:11;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Modified at request of root@ip-10-0-6-20.us-west-2.compute.internal
Line  770201: 02/19/2025 21:53:24;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Run at request of Scheduler@ip-10-0-6-20.us-west-2.compute.internal on exec_vnode (ip-10-0-5-174:ncpus=16)
Line  770202: 02/19/2025 21:53:24;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 82 and substate to 41
Line  770203: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-10-0-6-20, op_flag 7, alt_res_ptr (nil)
Line  770204: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line  770205: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  770206: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line  770207: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  770208: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue medium, op_flag 7, alt_res_ptr (nil)
Line  770209: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line  770210: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line  770211: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line  770212: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  770751: 02/19/2025 21:53:27;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received session ID for job: 388959
Line  770752: 02/19/2025 21:53:27;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 82 and substate to 42
Line  770893: 02/19/2025 21:53:37;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  773173: 02/19/2025 21:53:53;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  774499: 02/19/2025 21:54:15;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  777516: 02/19/2025 21:54:44;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  779663: 02/19/2025 21:55:18;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  784453: 02/19/2025 21:55:58;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  787699: 02/19/2025 21:56:44;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  791797: 02/19/2025 21:57:37;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line  794856: 02/19/2025 21:58:29;0080;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Obit received momhop:1 serverhop:1 state:R substate:42
Line  794857: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 50
Line  794863: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 51
Line  794870: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 52
Line  794871: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 53
Line  795266: 02/19/2025 21:58:30;0010;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Exit_status=1 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=38944kb resources_used.ncpus=16 resources_used.vmem=3293276kb resources_used.walltime=00:05:03
Line  795267: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-10-0-6-20, op_flag b, alt_res_ptr (nil)
Line  795268: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
Line  795269: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  795270: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line  795271: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line  795272: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue medium, op_flag b, alt_res_ptr (nil)
Line  795273: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for medium]
Line  795274: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line  795275: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line  795276: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0

Outcome:
Because of this issue, we see spikes in job start times causing failures in other parts of the system waiting for the job to run on time. There were compute nodes available during this time period

The questions we have are:

  • Why does PBS mark certain compute stacks as ineligible ?
  • Why do some nodes go into this create_node_buckets loop where no jobs get run on them for extended periods of time ?

Some help in understanding this issue would be greatly appreciated.

Thanks

Could you please try this configuration:

On the pbs server host as root user:

  • source /etc/pbs.conf
  • edit $PBS_HOME/sched_priv/sched_config
  • find the below line

provision_policy: “aggressive_provision”

and update it to

provision_policy: “avoid_provision”

  • save and exit the file
  • kill -HUP < PID of the pbs_sched process>
  • qmgr -c “set server scheduling = t”

Hope this resolves your issue.

Thanks for the suggestion, this is an intermittent issue and we’ll try to reproduce it again with this config in place. I’ll report back if this resolved the issue.

1 Like

Using avoid_provision didn’t help much. But I’ve noticed that our jobs have really high queue_rank values. Is this normal or does this signify that the scheduler is unable to effectively schedule/rank the jobs ?

From qstat output:
“queue_rank”:1742444171934033

queue_rank : A number indicating the job’s position within its queue. Only used internally by PBS.
As per the RG-338 - PBS Professional 2024.1 Reference Guide .

This queue_rank respresents a value at that specific time for that job, it might change based on the policy/priority/sorting/jobs etc. is my understanding.

If you could share the output of the below commands

source /etc/pbs.conf;cat $PBS_HOME/sched_priv/sched_config | grep -v "#" | grep ^[a-zA-Z]
qstat -Bf
qmgr -c 'p s'
pbsnodes -av
qstat -fx <job that you think can run now, but not running>

It turns out not be a scheduling issue, but all the nodes belonging to a queue seem to going in and out of INUSE_DOWN state together which results in no jobs being able to run.

I have a case of a node exhibiting the following behavior:

ip-11-0-4-212
• 03:14:41 - 03:47:51 (Offline for 33 minutes)
• 04:03:44 - 04:36:53 (Offline for 33 minutes)
• 06:37:11 - 07:10:31 (Offline for 33 minutes)
(OK for 15 minutes 52 seconds)
• 07:26:23 - 07:59:33
(OK for 16 minutes 32 seconds)
• 08:16:05 - 08:49:06
(OK for 16 minutes 14 seconds)
• 09:05:20 - 09:38:37
• 09:55:49 - 10:28:48
• 10:44:51 - 11:17:59
• 11:34:34 - 12:07:42
(OK for 16 minutes, and 2 seconds)
• 12:23:44 - 12:57:19
(OK for 16 minutes, and 21 seconds)
• 13:13:40 - 13:46:51
• …
• …
• 16:32:02 - 17:05:12
(OK for 15 minutes, and 53 seconds)
• 17:21:05 - 17:56:01

State changes for the last timestamp below. Notice the state change. No jobs could run on this node between 17:21:05 - 17:56:01 and a job finally starts running on it at 17:56:04

03/27/2025 17:19:57;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x0 vnode_o.state=0x10 vnode.last_state_change_time=1743095997 vnode_o.last_state_change_time=1743095981 state_bits=0xffffffffffffffaf state_bit_op_type_str=Nd_State_And state_bit_op_type_enum=2
03/27/2025 17:20:48;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x10 vnode_o.state=0x0 vnode.last_state_change_time=1743096048 vnode_o.last_state_change_time=1743095997 state_bits=0x10 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1
03/27/2025 17:21:05;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x0 vnode_o.state=0x10 vnode.last_state_change_time=1743096065 vnode_o.last_state_change_time=1743096048 state_bits=0xffffffffffffffaf state_bit_op_type_str=Nd_State_And state_bit_op_type_enum=2
03/27/2025 17:56:01;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x2 vnode_o.state=0x0 vnode.last_state_change_time=1743098161 vnode_o.last_state_change_time=1743096065 state_bits=0x2 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1
03/27/2025 17:56:01;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x2 vnode_o.state=0x2 vnode.last_state_change_time=1743098161 vnode_o.last_state_change_time=1743098161 state_bits=0x2 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1
03/27/2025 17:56:01;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x2 vnode_o.state=0x2 vnode.last_state_change_time=1743098161 vnode_o.last_state_change_time=1743098161 state_bits=0x2 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1
03/27/2025 17:56:01;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x2 vnode_o.state=0x2 vnode.last_state_change_time=1743098161 vnode_o.last_state_change_time=1743098161 state_bits=0x2 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1
03/27/2025 17:56:03;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x0 vnode_o.state=0x2 vnode.last_state_change_time=1743098163 vnode_o.last_state_change_time=1743098161 state_bits=0xfffffffffffefffd state_bit_op_type_str=Nd_State_And state_bit_op_type_enum=2
03/27/2025 17:56:04;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x10 vnode_o.state=0x0 vnode.last_state_change_time=1743098164 vnode_o.last_state_change_time=1743098163 state_bits=0x10 state_bit_op_type_str=Nd_State_Or state_bit_op_type_enum=1
03/27/2025 18:01:10;0100;Server@ip-11-0-6-34;Node;ip-11-0-4-212;set_vnode_state;vnode.state=0x0 vnode_o.state=0x10 vnode.last_state_change_time=1743098470 vnode_o.last_state_change_time=1743098164 state_bits=0xffffffffffffffaf state_bit_op_type_str=Nd_State_And state_bit_op_type_enum=2

This pattern keeps repeating every 16 minutes.

On the mom_logs, notice the jump between 17:21:05 and 17:56:03. For some reason as soon as the new IP addresses are authorized the node becomes responsive again.

03/27/2025 17:21:03;0006;pbs_python;Hook;pbs_python;pbs_event: type = <class 'str'>
03/27/2025 17:21:03;0006;pbs_python;Hook;pbs_python;pbs_event: hook_name = <class 'str'>
03/27/2025 17:21:03;0006;pbs_python;Hook;pbs_python;pbs_event: requestor = <class 'str'>
03/27/2025 17:21:03;0006;pbs_python;Hook;pbs_python;pbs_event: requestor_host = <class 'str'>
03/27/2025 17:21:03;0006;pbs_python;Hook;pbs_python;pbs_event: hook_type = <class 'str'>
03/27/2025 17:21:03;0006;pbs_python;Hook;pbs_python;pbs_event: vnode_list = <class 'dict'>
03/27/2025 17:21:03;0006;pbs_python;Hook;pbs_python;pbs_event: job = <class 'dict'>
03/27/2025 17:21:04;0006;pbs_python;Hook;pbs_python;HookExecution TotalTime: 579 ms
03/27/2025 17:21:04;0008;pbs_mom;Job;12964.ip-11-0-6-34;no active tasks
03/27/2025 17:21:05;0008;pbs_mom;Job;12964.ip-11-0-6-34;kill_job
03/27/2025 17:21:05;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
03/27/2025 17:21:05;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
03/27/2025 17:56:03;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
03/27/2025 17:56:03;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
03/27/2025 17:56:03;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
03/27/2025 17:56:03;0c06;pbs_mom;TPP;pbs_mom(Thread 0);tpp_mbox_read;Unable to read from msg box
03/27/2025 17:56:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 11.0.2.64 as authorized
03/27/2025 17:56:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 11.0.3.3 as authorized
03/27/2025 17:56:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 11.0.5.4 as authorized
03/27/2025 17:56:03;0002;pbs_mom;Svr;pbs_mom;Adding IP address 11.0.5.13 as authorized

I have 4 other nodes belonging to the same queue with the same issue
ip-11-0-2-196
ip-11-0-2-55
ip-11-0-4-136

Could be coincidental but the following 4 nodes were removed just before the above nodes started exhibiting the issue:

Nodes that went offline just before issue:
ip-11-0-3-249
ip-11-0-5-172
ip-11-0-2-148
ip-11-0-5-152

I realize these IP address lists are not actionable information, they are just to depict the different groups of nodes involved.

If there anyway to have PBS refresh its internal node cache on a regular basis?

This is unlikely, but I wonder if you’re having a problem with the IP ↔ MAC address cache in your switch timing out. You could test this by running a continuous ping from your server host to one of the nodes that has the issue. If the node is okay so long as the ping is active, this might be the cause.