There is only one available node in my queue. I submitted two jobs, one to be executed directly and the other to be queued for resources. Then I added a new node to the queue, but the jobs in the queue did not execute immediately. Instead, they waited for about 8 minutes before being scheduled to be used on the new node. Why do I need to wait so long?
A node is unable to execute the job assigned to it by PBS due to certain issues. When the job is assigned to the node, it will report an error and terminate after a few seconds. However, PBS did not disable this node, and subsequent queued jobs will still be assigned to this node. As a result, they will all continue to report errors and terminate. I need to manually handle this problem node and resubmit those jobs, which is very bad
adding a new node does not trigger a scheduling cycle.
Try this :
qmgr : create node nodename
qmgr : set server scheduling = true
Your queued job will run, if the node can accomodate it .
you can wirte a periodic mom hook ( kind of a node health check script) that will monitor the node, if any of the monitoring checks fail ( say /home directory not found, disk full, issues with the network) then it can offline the node and update the comment on the node, so that it can be investigated by the administrator and also it prevents other jobs landing on to it.
The first issue is that I have already created a node and I want it to be associated with my queue (which already has a node associated with it), but the newly added node is not immediately called to the queued job. The attribute scheduling of server has been True.
For the second question, a periodic hook has been set up to check mom, which is a prerequisite for system operation and maintenance. If mom encounters an exception after the periodic check, the error I mentioned will still occur when executing the job, which cannot prevent bad things from happening in a timely manner.
Suggestion was to run the command qmgr -c "set server scheduling = true" again to explicitly trigger a scheduling cycle. Although scheduling is already set to true, re-running this command forces a new cycle, allowing the scheduler to detect newly added or reconfigured nodes and consider them for job placement if they meet the job’s requirements.
We should also consider the corner cases around the periodic hook. In this scenario, the periodic hook encountered an exception, and it is unclear whether the node was successfully offlined as a result. However, PBS continues to schedule jobs onto this node, which suggests that it may still be considered available by the scheduler.
As a safeguard, it may be useful to add logic that updates a timestamp file on successful hook execution. A cron job could then monitor this timestamp.txt file and, if it has not been updated for a defined period, proactively offline the node. This would help ensure that nodes in an uncertain or unhealthy state are not selected for scheduling.
Thank you very much for your suggestion, Mr. adarsh. It is very useful. I learned that a queue is started can also force a new cycle, but when I set my queue started = True or enabled= True, it did not work like qmgr -c "set server scheduling = true".