we have a complex that creates/uses/destroys nodes very frequently.
this runs smoothly aside from occasionally jobs wont start on the nodes issue.
pbsnodes shows the node as up and free, the resources associated with the node(s) nicely match the resource requested in the job but its always “job cannot run”.
restarting the PBS server process gets everything back in sync and the jobs will
start running again.
The thought occurred to me that maybe the server is getting its in memory view of what the nodes look like messed up and it takes a restart to re-validate everything again.
This then got me wondering if there was a easier way to get the server to re-validate besides a server restart. as that takes a long(ish) time and qstat/qsub/etc commands fail while its being restarted.
Please try to initiate a scheduling cycle manually after creating node, to see it helps to avoid service restart.
qmgr -c "set server scheduling=t"
Same here: it’s not the server that is not in sync (by definition pbsnodes reflects the server view of the world) but the scheduler; if your scheduling cycles are long the scheduler will try to place jobs using the now stale state of the world that it fetched at the start of the cycle.
If you create nodes and want jobs to use them, set server scheduling to false using qmgr, wait for the scheduler to end its cycle (or kill and restart it i.e. pbs_sched, and not pbs_server; it’s stateless so it does no harm) and then set scheduling to true again.
this makes a lot of sense yes.
I will try these solutions and see what happens the next time it randomly gets confused.
finally got around to being able to test this. (blame the lack of a real govt budget)
for whatever reason restarting the scheduler does not unwedge things.
it takes a restart of the server.
I had hopes for the restarting schedule thing. it made sense.
could it be mom related? a server restart when it re-connects to the moms on the nodes
might be re-validating something that finally allows life to go on.
You said create and destroy the nodes frequently? Could something on that side be off (bad IP, comm, mis-started service), something that didn’t quite tick off right until the server was rebooted?