Waiting time and error handling

wakaka · January 21, 2026, 5:47am

I have encountered two problems:

There is only one available node in my queue. I submitted two jobs, one to be executed directly and the other to be queued for resources. Then I added a new node to the queue, but the jobs in the queue did not execute immediately. Instead, they waited for about 8 minutes before being scheduled to be used on the new node. Why do I need to wait so long?
A node is unable to execute the job assigned to it by PBS due to certain issues. When the job is assigned to the node, it will report an error and terminate after a few seconds. However, PBS did not disable this node, and subsequent queued jobs will still be assigned to this node. As a result, they will all continue to report errors and terminate. I need to manually handle this problem node and resubmit those jobs, which is very bad

adarsh · January 21, 2026, 7:56pm

adding a new node does not trigger a scheduling cycle.
Try this :
qmgr : create node nodename
qmgr : set server scheduling = true
Your queued job will run, if the node can accomodate it .
you can wirte a periodic mom hook ( kind of a node health check script) that will monitor the node, if any of the monitoring checks fail ( say /home directory not found, disk full, issues with the network) then it can offline the node and update the comment on the node, so that it can be investigated by the administrator and also it prevents other jobs landing on to it.

Reference: Integrating NHC with PBS Pro
Also, you can check the examples in the administrator guide.

wakaka · January 22, 2026, 9:07am

Hi, adarsh, thanks for your reply.

The first issue is that I have already created a node and I want it to be associated with my queue (which already has a node associated with it), but the newly added node is not immediately called to the queued job. The attribute scheduling of server has been True.

qmgr -c  ‘set node n1 resources_available.QList = q1’

qmgr -c  ‘set queue q1 default_chunk.QList = q1’

For the second question, a periodic hook has been set up to check mom, which is a prerequisite for system operation and maintenance. If mom encounters an exception after the periodic check, the error I mentioned will still occur when executing the job, which cannot prevent bad things from happening in a timely manner.

adarsh · January 23, 2026, 8:09pm

Suggestion was to run the command qmgr -c "set server scheduling = true" again to explicitly trigger a scheduling cycle. Although scheduling is already set to true, re-running this command forces a new cycle, allowing the scheduler to detect newly added or reconfigured nodes and consider them for job placement if they meet the job’s requirements.
We should also consider the corner cases around the periodic hook. In this scenario, the periodic hook encountered an exception, and it is unclear whether the node was successfully offlined as a result. However, PBS continues to schedule jobs onto this node, which suggests that it may still be considered available by the scheduler.
As a safeguard, it may be useful to add logic that updates a timestamp file on successful hook execution. A cron job could then monitor this timestamp.txt file and, if it has not been updated for a defined period, proactively offline the node. This would help ensure that nodes in an uncertain or unhealthy state are not selected for scheduling.

wakaka · January 26, 2026, 6:50am

Thank you very much for your suggestion, Mr. adarsh. It is very useful. I learned that a queue is started can also force a new cycle, but when I set my queue started = True or enabled= True, it did not work like qmgr -c "set server scheduling = true".

Topic		Replies	Views
Delay in running job when nodes are coming online Users/Site Administrators	3	385	March 3, 2021
Revalidating nodes Users/Site Administrators	15	682	June 13, 2025
Is it possible to execute queued jobs automatically after computing nodes up? Users/Site Administrators	2	1107	November 9, 2018
Some jobs stay queued for extended periods of time despite availability of hosts Users/Site Administrators	7	195	July 8, 2025
How to move queued job to new created execution nodes Users/Site Administrators	2	771	January 31, 2019

Waiting time and error handling

Related topics