PP-515: quit mom safely when multinode jobs are present

Hello, I want to open a discussion on how to safely restart pbs_mom when multinode jobs are present. For example if you need to restart the pbs_mom after now version of mom is installed. I prepared a patch, in which the pbs_mom catches SIGQUIT signal and quit only if no multinode job is present. If a multinode job is present pbs_mom will wait and quit as soon as no such a job is present. I’m of course aware of the problem that the scheduler can start new multinode job again and again and the mom never quit.

The first idea was that the pbs_mom catches the SIGQUIT and quit only if no multinode job is present otherwise nothing happen. Maybe this simpler idea is more suitable.

Documentation here. Please review.

Hi @vchlum

As I understand, the intent here is to finish the multi-node job before making the mom quit.
I think it will help if the mom restrain itself from accepting more jobs from server after it gets this signal, like letting server know that it is marking itself in state offline. This will also mean that no further jobs will be scheduled on this mom.

what do you think?

Greetings @vchlum,

First of all, thank you for contributing to the forum. We welcome your input and suggestions.

It seems to me that the change you are proposing might be part of a larger task you are trying to accomplish. Is the intent to perform rolling upgrades throughout the cluster while preserving multinode jobs? If that is the case, then I think we should discuss the larger picture as opposed to one component of the solution.

One concern I have with your current proposal is the use of SIGQUIT. For one, it requires the admin instantiate a process on a mom node (kill -QUIT) as opposed to delivering the request from qmgr via the server. Second, is the use of SIGQUIT itself, which is defined as:

SIGQUIT: The SIGQUIT signal is sent to a process by its controlling terminal when the user requests that the process quit and perform a core dump.

Please let us know your thoughts on this.

Thanks,

Mike

Hi @arungrover and @mkaro ,

indeed, I want to be able to roll update throughout the cluster without loss of any jobs. I doubt it is easy to preserve multinode jobs during pbs_mom restart. So I decided to wait until no multinode job is present. We use puppet so my use case of this feature should be like this: Step 1) Puppet will install new version of mom. Step 2) Puppet will call small script. This script calls kill, wait to mom to quit, start mom again.

It is a good idea to restrain mom from accepting new jobs but I’m not sure if set the node to offline is a good idea. Will the remaining jobs stage out their files correctly?

I understand the concern about the SIGQUIT now. Didn’t realize this dedication. Since other termination signals are already taken I think I can use SIGUSR1.

Hello @vchlum,

Setting a node offline will prevent any new jobs from being assigned to it while allowing running jobs to complete. Any files requiring stage-out will be handled properly, even when the node is offline. In terms of draining a node in preparation for upgrade, this would be the strategy I would recommend. Once a node is free, you may use puppet to upgrade it.

Regarding the selection of a signal, SIGUSR1 is a far better choice. Given the suggestion in the previous paragraph, do you still see a need for sending a signal to pbs_mom? If you send a SIGTERM, pbs_mom will perform a controlled shutdown. If you restart pbs_mom with the “-p” parameter, it will begin polling the PIDs of the jobs that had been running prior to shutdown. If you just want to wait for the multinode jobs to complete, this may expedite the upgrade process.

Thanks,

Mike

Hello @mkaro,

if I set the node to offline (using qmgr), I’ will not by able to distinguish between singlenode jobs and multinode jobs on the node. Multinode jobs can’t be recovered after SIGTERM. So I don’t know when to use SIGTERM and pbs_mom -p. It is waste of resources to wait to all jobs are finished. I can’t afford this. I just need to wait for only multinode jobs. I need solution for upgrading the whole infrastructure, which is hundreds of nodes. So I still see the best solution to send the signal, which allow me to upgrade most of the infrastructure immediately.

I also tested to set the node to OFFLINE from mom but it keeps to set free back. Set node to BUSY works well and it is maybe more suitable, isn’t it?

What do you think about simpler and more general feature like this: Signal SIGUSR1 will quit mom only if no multinode is present, otherwise it does nothing. This can by used for detection of multinodes jobs, which I basically need.

Vasek

yes, the -p paramter does not take care of multi-node jobs on server restart and the only code mom has now is to cull all multi-node jobs on startup. Preserving multi-node jobs over a mom restart will be a much bigger change; somehow the mom would need to re-establish the join-job to its sisters etc.

Setting the node offline is by far the best means of making sure no new jobs land on that mom. One alternate solution (which you might be able to do via chef or puppet) would be to run a small script to mark the node offline and then query pbs (pbsnodes) every few seconds intervals to see if the jobs that were running on it ended or not. This way, the solution could be implemented outside of pbs. Since new jobs will not land on pbs when the node is set offline, we may not need to distinguish between single and multi-node jobs. If we still need to differentiate, once you find a job running, you can combine the data with that from a qstat to know whether its a single node or multi-node job.

BTW, if offline does not work, its a pbs bug, and you can kindly file a new bug for the same. Offline was specifically intended for such purposes.

Well, ok. I will try to find solution outside pbs. I’m quite surprised, there is no best practices how to upgrade mom. Or is it?

How do you actually solve this upgrade problem? You always set the cluster offline and wait to drain all jobs?

Your use-case is very important to the community in general and we need to work to add that capability to upgrade. However, instead of waiting to shut down mom, I feel the ideal solution would be to actually support restarting the mom in the presence of multinode jobs. If we want to wait for the mom to finish multi-node jobs, there currently exists this way to do it by offlining the node - of course its a bit clunky in that you have to check using an external script.

Currently, I believe that cluster admins use this offline method - they wait for jobs to drain off before shutting down.

One way to completely automate this could be to run a upgrade script as a job that takes the node exclusively. That way, such a job can run only when the node is completely free and no other job is spanning on that node.

The way we handle this is as follows (approximately):

We have a custom node resource called “reboot” (for historical reasons).

In the server, we have
default_chunk.reboot = free

Thus, nodes won’t be assigned to jobs unless reboot has the value “free”.

When a node boots, part of the pbs init script sets the node’s reboot value to “free”. So, nodes start out assignable, as far as the reboot resource is concerned.

Next, in the epilogue, we check for files in the locally created /PBS/flags/ directory called “reboot”, “reboot-once”, or “eoj-once”. If any of these files is present, the epilogue changes the reboot resource to “reboot”. Thus, this node is no longer eligible for new work. Next, the epilogue spawns off a separate process that checks every so often for the node to go idle. Meanwhile, the epilogue continues and job cleanup happens, etc.

When the node goes idle, the background process performs the actions implied by the flag files: reboot - reboot after each job; reboot-once - remove the flag file and reboot; eoj-once - read the contents of the eoj-once file as the path to a command to execute, after removing the flag file.

For your case, the eoj-once command could restart the MoM. We have used it to restart other daemons, to run diagnostics weekly, etc. The last step in the command should be to set the reboot resource back to “free” to indicate the node is ready for work.

We use the reboot-once flag to perform rolling updates into new images.

You could use the offline state, rather than a custom resource, but we believe the custom resource is cleaner.

Thank you for sharing! Looks interesting.