Revalidating nodes

we have a complex that creates/uses/destroys nodes very frequently.
this runs smoothly aside from occasionally jobs wont start on the nodes issue.
pbsnodes shows the node as up and free, the resources associated with the node(s) nicely match the resource requested in the job but its always “job cannot run”.
restarting the PBS server process gets everything back in sync and the jobs will
start running again.
The thought occurred to me that maybe the server is getting its in memory view of what the nodes look like messed up and it takes a restart to re-validate everything again.
This then got me wondering if there was a easier way to get the server to re-validate besides a server restart. as that takes a long(ish) time and qstat/qsub/etc commands fail while its being restarted.

thanks
s

Please try to initiate a scheduling cycle manually after creating node, to see it helps to avoid service restart. qmgr -c "set server scheduling=t"

Same here: it’s not the server that is not in sync (by definition pbsnodes reflects the server view of the world) but the scheduler; if your scheduling cycles are long the scheduler will try to place jobs using the now stale state of the world that it fetched at the start of the cycle.

If you create nodes and want jobs to use them, set server scheduling to false using qmgr, wait for the scheduler to end its cycle (or kill and restart it i.e. pbs_sched, and not pbs_server; it’s stateless so it does no harm) and then set scheduling to true again.

this makes a lot of sense yes.
I will try these solutions and see what happens the next time it randomly gets confused.

thanks

finally got around to being able to test this. (blame the lack of a real govt budget)

for whatever reason restarting the scheduler does not unwedge things.
it takes a restart of the server.

I had hopes for the restarting schedule thing. it made sense.

could it be mom related? a server restart when it re-connects to the moms on the nodes
might be re-validating something that finally allows life to go on.

thanks
s

You said create and destroy the nodes frequently? Could something on that side be off (bad IP, comm, mis-started service), something that didn’t quite tick off right until the server was rebooted?

Im thinking the same solution from

might be the issue Im seeing from this.
testing to know for sure.

thanks
s

Just following up to ask if you ever found a root cause for this issue, seems we’re both running into the same or a similar issue.

still having the issue.
but in re-reading the topic just now I had not understood adarsh’s/alexis’ response fully and probably did the wrong thing to fix.or at least the proper sequence of things.
will try again

still needing to restart the server.
basic process in case something leaps out at someone:

loop forever:

  • new job shows up:
    • launch new machines in the cloud to match the resources selected in job
    • create vnodes for new machines
  • if jobs exists:
    • look for any that are still queued:
      • is the ‘state’ for the nodes matched up to this job marked as ‘free’:
        • has the job been waiting to start long (> 2minutes / 6 scheduler cycles):
          • option 1: restart server process
          • option 2:
            • set scheduling to False in qmgr
            • stop scheduler (ptl_testlib.Scheduler().stop())
            • wait a couple seconds
            • initialize scheduler (ptl_testlib.Scheduler().initialise())
            • start scheduler (ptl_testlib.Scheduler().start())
            • set scheduling to True in qmgr

option 1 always gets the scheduler to run the job on the nodes at the start of the next scheduling cycle.
option 2 doesnt even after many scheduling cycles

Thanks for the detailed workaround ! We went another way since our workload cannot accept a scheduler restart. One of our findings was that this issue is triggered when a cloud machine leaves the pool of compute nodes, the remaining ones get stuck for 33 minutes. But if you add a new machine to the pool, the remaining machines become unstuck.

So our workaround has been to monitor the state of queued jobs and if there are any that are queued for more than 10 minutes we spin up a new machine, which makes the remaining nodes available again. While this has minor cost implications, it’s better than jobs not getting run on time.

We still don’t have a root cause identified for the issue, but our evaluation of PBS Pro showed that this issue is not reproducible in the latest release, so it definitely is a bug in OpenPBS, we just don’t know which bug it is.

interesting solution, I think I will steal that idea. Our users complain when qstat fails and my solution has it not operating for a minute.
But Im wondering if it has to be a whole new new machine or if an existing machine could have a new vnode added to it.
(Our PBS server has a few vnodes assigned to it to do certain daily internal tasks that are done via batch jobs outside user control, maybe adding a useless vnode and removing once jobs start flowing again also might work. No cost implications.)

thanks for the new solutions.
s

update on the useless vnode concept: in my case its not going to be any better. The vnodes on our PBS server are set up via a config file which looks to require a PBS service restart during the update process.
If our vnodes were set up dynamically on the PBS server it has a chance of working.
I have to look into going that route.
Or maybe just using your original idea of a new small/cheap instance starting up for a few minutes to kick scheduling is ok.

thanks
s

(but maybe the docs are a little sloppy and when they say to restart PBS it really only means restarting the mom on the node, as having vnodes on the PBS server really isnt normal…)

update: for me starting up a new job that launched a new node for a short job let the “hung” jobs start. But then the next job also hung which required a short job to kick things off and so on.
A new vnode on an existing node didnt help.
So in the end I will be stuck with restarting the PBS server process occasionally.
and hope that the next version of openPBS happens to fix the issue on purpose or accidentally.

thanks
s

An update on our own testing of the ‘add/remove nodes to get unstuck’ strategy shows that this also fails to work sometimes. So it’s not a foolproof solution. Looks like we’ll have to go with a restart based solution like you.

Through my parsing of the docs it seems like using qterm -t quick to shutdown (ensuring running jobs don’t get killed) and then pbs_server -t warm to restart is the way to go. Is that the process you ended up for safely restarting the server ?

Im doing everything inside python so I use the ‘ptl’ package to interact with the PBS services.
import ptl.lib.pbs_testlib as ptl_testlib
ptl_testlib.Server(server_name).restart()

will just restart the pbs_server process leaving mom/sched/jobs/etc alone to continue on as normal.