The way we handle this is as follows (approximately):
We have a custom node resource called “reboot” (for historical reasons).
In the server, we have
default_chunk.reboot = free
Thus, nodes won’t be assigned to jobs unless reboot has the value “free”.
When a node boots, part of the pbs init script sets the node’s reboot value to “free”. So, nodes start out assignable, as far as the reboot resource is concerned.
Next, in the epilogue, we check for files in the locally created /PBS/flags/ directory called “reboot”, “reboot-once”, or “eoj-once”. If any of these files is present, the epilogue changes the reboot resource to “reboot”. Thus, this node is no longer eligible for new work. Next, the epilogue spawns off a separate process that checks every so often for the node to go idle. Meanwhile, the epilogue continues and job cleanup happens, etc.
When the node goes idle, the background process performs the actions implied by the flag files: reboot - reboot after each job; reboot-once - remove the flag file and reboot; eoj-once - read the contents of the eoj-once file as the path to a command to execute, after removing the flag file.
For your case, the eoj-once command could restart the MoM. We have used it to restart other daemons, to run diagnostics weekly, etc. The last step in the command should be to set the reboot resource back to “free” to indicate the node is ready for work.
We use the reboot-once flag to perform rolling updates into new images.
You could use the offline state, rather than a custom resource, but we believe the custom resource is cleaner.