Revalidating nodes

we have a complex that creates/uses/destroys nodes very frequently.
this runs smoothly aside from occasionally jobs wont start on the nodes issue.
pbsnodes shows the node as up and free, the resources associated with the node(s) nicely match the resource requested in the job but its always “job cannot run”.
restarting the PBS server process gets everything back in sync and the jobs will
start running again.
The thought occurred to me that maybe the server is getting its in memory view of what the nodes look like messed up and it takes a restart to re-validate everything again.
This then got me wondering if there was a easier way to get the server to re-validate besides a server restart. as that takes a long(ish) time and qstat/qsub/etc commands fail while its being restarted.

thanks
s

Please try to initiate a scheduling cycle manually after creating node, to see it helps to avoid service restart. qmgr -c "set server scheduling=t"

Same here: it’s not the server that is not in sync (by definition pbsnodes reflects the server view of the world) but the scheduler; if your scheduling cycles are long the scheduler will try to place jobs using the now stale state of the world that it fetched at the start of the cycle.

If you create nodes and want jobs to use them, set server scheduling to false using qmgr, wait for the scheduler to end its cycle (or kill and restart it i.e. pbs_sched, and not pbs_server; it’s stateless so it does no harm) and then set scheduling to true again.

this makes a lot of sense yes.
I will try these solutions and see what happens the next time it randomly gets confused.

thanks

finally got around to being able to test this. (blame the lack of a real govt budget)

for whatever reason restarting the scheduler does not unwedge things.
it takes a restart of the server.

I had hopes for the restarting schedule thing. it made sense.

could it be mom related? a server restart when it re-connects to the moms on the nodes
might be re-validating something that finally allows life to go on.

thanks
s

You said create and destroy the nodes frequently? Could something on that side be off (bad IP, comm, mis-started service), something that didn’t quite tick off right until the server was rebooted?

Im thinking the same solution from

might be the issue Im seeing from this.
testing to know for sure.

thanks
s

Just following up to ask if you ever found a root cause for this issue, seems we’re both running into the same or a similar issue.

still having the issue.
but in re-reading the topic just now I had not understood adarsh’s/alexis’ response fully and probably did the wrong thing to fix.or at least the proper sequence of things.
will try again

still needing to restart the server.
basic process in case something leaps out at someone:

loop forever:

  • new job shows up:
    • launch new machines in the cloud to match the resources selected in job
    • create vnodes for new machines
  • if jobs exists:
    • look for any that are still queued:
      • is the ‘state’ for the nodes matched up to this job marked as ‘free’:
        • has the job been waiting to start long (> 2minutes / 6 scheduler cycles):
          • option 1: restart server process
          • option 2:
            • set scheduling to False in qmgr
            • stop scheduler (ptl_testlib.Scheduler().stop())
            • wait a couple seconds
            • initialize scheduler (ptl_testlib.Scheduler().initialise())
            • start scheduler (ptl_testlib.Scheduler().start())
            • set scheduling to True in qmgr

option 1 always gets the scheduler to run the job on the nodes at the start of the next scheduling cycle.
option 2 doesnt even after many scheduling cycles