Hi, We’re using the 23.06 release of OpenPBS and facing an issue where most jobs run on-time successfully but a small number of jobs stay queued for extended periods of time despite nodes being available.
Investigation:
Lets take PBS Job ID 1094 as an example. Here, the job gets submitted at 02/19/2025 21:00:08 and doesn’t actually run until 02/19/2025 21:53:24 despite nodes being available. This job was submitted to the custom “medium” queue.
The logs for this look like:
Line 3: Line 48308: 02/19/2025 21:00:08;0080;pbs_sched;Job;1094.ip-10-0-6-20;Considering job to run
Line 4: Line 48309: 02/19/2025 21:00:08;0040;pbs_sched;Job;1094.ip-10-0-6-20;Not enough total nodes available
Line 5: Line 48310: 02/19/2025 21:00:08;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job will never run with the resources currently configured in the complex
...
...
...
...
Line 2986: Line 71548: 02/19/2025 21:53:19;0040;pbs_sched;Job;1094.ip-10-0-6-20;Not enough total nodes available
Line 2987: Line 71549: 02/19/2025 21:53:19;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job will never run with the resources currently configured in the complex
Line 2988: Line 71575: 02/19/2025 21:53:24;0080;pbs_sched;Job;1094.ip-10-0-6-20;Considering job to run
Line 2989: Line 71576: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 2990: Line 71578: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 2991: Line 71579: 02/19/2025 21:53:24;0040;pbs_sched;Job;1094.ip-10-0-6-20;Job run
Line 2992: Line 71799: 02/19/2025 21:54:04;0400;pbs_sched;Job;1094.ip-10-0-6-20;Simulation: job end point [Mon Feb 18 21:53:25 2030]
In the PBS server logs, I see these messages an hour before the job submission:
Line 33120: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33132: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33144: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
Line 33156: 02/19/2025 20:01:47;0400;pbs_sched;Node;;Marking nodes with signature compute_node=medium-ondemand-gfefa6bd1:ncpus=16 ineligible
The nodes in this compute stack remain in use with jobs getting executed on them until there are large gaps where see lots of these “Create node bucket” messages for long periods of time with no chunks getting allocated e.g:
Line 45459: 02/19/2025 20:47:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
You can see an example of this visually in the screenshot below. The highlighted lines are where “Allocated one subchunk” lines were present indicating node usage for job execution, but then from the bar on the right side you’ll see an extended period of time where we just got the “Created node bucket messages.”
The PBS job ID 1094 was submitted during this period 21:00:08 and remained queued.
Line 48265: 02/19/2025 20:59:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48277: 02/19/2025 21:00:00;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48289: 02/19/2025 21:00:05;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48301: 02/19/2025 21:00:08;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 48316: 02/19/2025 21:00:09;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
This lasts until 21:53:24 when the node becomes available again somehow and Job 1094 gets allocated a chunk on it.
Line 71229: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71259: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71292: 02/19/2025 21:52:50;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71328: 02/19/2025 21:52:55;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71372: 02/19/2025 21:53:00;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71396: 02/19/2025 21:53:05;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71420: 02/19/2025 21:53:10;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71444: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71468: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71492: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71516: 02/19/2025 21:53:14;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71540: 02/19/2025 21:53:19;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71564: 02/19/2025 21:53:24;0400;pbs_sched;Node;create_node_buckets;Created node bucket compute_node=medium-ondemand-gfefa6bd1:ncpus=16:accelerator=False:queue=medium
Line 71566: 02/19/2025 21:53:24;0400;pbs_sched;Node;1036.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71568: 02/19/2025 21:53:24;0400;pbs_sched;Node;1036.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71571: 02/19/2025 21:53:24;0400;pbs_sched;Node;1037.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71573: 02/19/2025 21:53:24;0400;pbs_sched;Node;1037.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71576: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Evaluating subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Line 71578: 02/19/2025 21:53:24;0400;pbs_sched;Node;1094.ip-10-0-6-20;Allocated one subchunk: ncpus=16:compute_node=medium-ondemand-gfefa6bd1
Some more detail of the job’s lifecycle from start to end:
Line 638789: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: entered for medium
Line 638790: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for medium]
Line 638791: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: entered for medium
Line 638792: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line 638793: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: entered for server
Line 638794: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_max: exiting, ret 0 [max_queued limit not set for server]
Line 638795: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: entered for server
Line 638796: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_ct_limit_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line 638797: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: entered for medium, alt_res (nil)
Line 638798: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line 638799: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: entered for medium, alt_res (nil)
Line 638800: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line 638801: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: entered for server, alt_res (nil)
Line 638802: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_max: exiting, ret 0 [max_queued_res limit not set for server]
Line 638803: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: entered for server, alt_res (nil)
Line 638804: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: check_entity_resc_limit_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line 638807: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on server ip-10-0-6-20, op_flag f, alt_res_ptr (nil)
Line 638808: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
Line 638809: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line 638810: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line 638811: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line 638812: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line 638813: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line 638814: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line 638815: 02/19/2025 21:00:08;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 81 and substate to 10
Line 638816: 02/19/2025 21:00:08;0100;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;enqueuing into medium, state Q hop 1
Line 638817: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on queue medium, op_flag f, alt_res_ptr (nil)
Line 638818: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for medium]
Line 638819: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line 638820: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line 638821: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line 638822: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line 638823: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line 638824: 02/19/2025 21:00:08;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line 638825: 02/19/2025 21:00:08;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Queued at request of jobsubmit@ip-10-0-6-20.us-west-2.compute.internal, owner = jobsubmit@ip-10-0-6-20.us-west-2.compute.internal, job name = 5af17c33-ddfd-4c14-8fd6-fef115006a15, queue = medium
Line 638854: 02/19/2025 21:00:08;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Modified at request of Scheduler@ip-10-0-6-20.us-west-2.compute.internal
Line 639138: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on server ip-10-0-6-20, op_flag e, alt_res_ptr 0x1619a860
Line 639139: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res 0x1619a860]
Line 639140: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line 639141: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res 0x1619a860]
Line 639142: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line 639143: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line 639144: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, INCR on queue medium, op_flag e, alt_res_ptr 0x1619a860
Line 639145: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res 0x1619a860]
Line 639146: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line 639147: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res 0x1619a860]
Line 639148: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line 639149: 02/19/2025 21:00:11;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line 639150: 02/19/2025 21:00:11;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Modified at request of root@ip-10-0-6-20.us-west-2.compute.internal
Line 770201: 02/19/2025 21:53:24;0008;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Job Run at request of Scheduler@ip-10-0-6-20.us-west-2.compute.internal on exec_vnode (ip-10-0-5-174:ncpus=16)
Line 770202: 02/19/2025 21:53:24;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 82 and substate to 41
Line 770203: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-10-0-6-20, op_flag 7, alt_res_ptr (nil)
Line 770204: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for server]
Line 770205: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line 770206: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for server]
Line 770207: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line 770208: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue medium, op_flag 7, alt_res_ptr (nil)
Line 770209: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_queued: exiting, ret 0 [queued_jobs_threshold limit not set for medium]
Line 770210: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: entered [alt_res (nil)]
Line 770211: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_queued: exiting, ret 0 [queued_jobs_threshold_res limit not set for medium]
Line 770212: 02/19/2025 21:53:24;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line 770751: 02/19/2025 21:53:27;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received session ID for job: 388959
Line 770752: 02/19/2025 21:53:27;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 82 and substate to 42
Line 770893: 02/19/2025 21:53:37;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 773173: 02/19/2025 21:53:53;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 774499: 02/19/2025 21:54:15;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 777516: 02/19/2025 21:54:44;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 779663: 02/19/2025 21:55:18;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 784453: 02/19/2025 21:55:58;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 787699: 02/19/2025 21:56:44;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 791797: 02/19/2025 21:57:37;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Received the same SID as before: 388959
Line 794856: 02/19/2025 21:58:29;0080;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Obit received momhop:1 serverhop:1 state:R substate:42
Line 794857: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 50
Line 794863: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 51
Line 794870: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 52
Line 794871: 02/19/2025 21:58:29;0400;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Updated job state to 69 and substate to 53
Line 795266: 02/19/2025 21:58:30;0010;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;Exit_status=1 resources_used.cpupercent=0 resources_used.cput=00:00:00 resources_used.mem=38944kb resources_used.ncpus=16 resources_used.vmem=3293276kb resources_used.walltime=00:05:03
Line 795267: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on server ip-10-0-6-20, op_flag b, alt_res_ptr (nil)
Line 795268: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for server]
Line 795269: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line 795270: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for server]
Line 795271: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Line 795272: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: entered, DECR on queue medium, op_flag b, alt_res_ptr (nil)
Line 795273: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_ct_sum_max: exiting, ret 0 [max_queued limit not set for medium]
Line 795274: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: entered [alt_res (nil)]
Line 795275: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: set_entity_resc_sum_max: exiting, ret 0 [max_queued_res limit not set for medium]
Line 795276: 02/19/2025 21:58:30;0800;Server@ip-10-0-6-20;Job;1094.ip-10-0-6-20;ET_LIM_DBG: account_entity_limit_usages: exiting, ret_error 0
Outcome:
Because of this issue, we see spikes in job start times causing failures in other parts of the system waiting for the job to run on time. There were compute nodes available during this time period
The questions we have are:
- Why does PBS mark certain compute stacks as ineligible ?
- Why do some nodes go into this create_node_buckets loop where no jobs get run on them for extended periods of time ?
Some help in understanding this issue would be greatly appreciated.
Thanks