PP-389: Allow the admin to suspend jobs for node maintenance

Summarizing a Skype interaction between @jon and me:

The per-job interface fits because…

Although the main use case is node based (the admin wants to perform maintenance on a subset of nodes (e.g., a rack) or on the entire system (e.g., the set of all nodes)), there is also a requirement to minimize the impact of unresolved issues. in particular, if there are multiple jobs that were suspended for maintenance, the admin would like to resume one (or just a few) jobs to ensure everything is working correctly before resuming all jobs. This progressive testing requires a per-job interface (at least at the time of resumption).

On another topic, now that the design has been changed to include the requirement that the admin must explicitly disable scheduling new jobs (either via offlining nodes, explicitly stopping the scheduler(s), or scheduling dedicated time), there may not be a need to invent new job and node states. It may be reasonable to simply allow an admin to resume a suspended job (either without invoking the scheduler or at last onto an offlined/dedicated-timed node). In other words, the simplified path for maintenance would be something like:
a. offline the set of nodes on which to perform maintenance
b. suspend any jobs still running on those nodes
-. Now no jobs are running and no jobs will be started/resumed on those nodes
c. Perform maintenance
d. resume one (or a few) jobs and ensure the system is healthy – requires a new ability for the admin to resume a job on an offlined node)
e. if the system is healthy, admin resumes the rest of the jobs and then marks nodes online

There are two use cases that I failed to mention. I have added them to the ticket. There is no requirement that the scheduling must be disabled or the nodes be off-lined in order to suspend the jobs. This will be our recommendation to avoid potential race conditions but not a requirement

I have altogether a different question. Current design has new node attribute called “maintenance_jobs”. Why is this a node attribute and not a job state or job attribute? Wouldn’t it be easy for the admin to list all the jobs that are suspended due to maintenance and resume them?

The main reason for the maintenance_jobs node attribute is an internal one. Once a job is suspended, it is removed from the node. The node needs to know the maintenance jobs on that node so it knows when to take the node out of maintenance. I could have kept the attribute hidden, but I thought it could be of use to the admin.

As for a job state, I thought about it. I didn’t think the new job state was warranted. The admin can still do what you are suggesting. All of the maintenance jobs are suspended. If the admin does a qsig -s admin-resume with qselect -s S, PBS will try and resume all of the suspended jobs. Any job that was not admin-suspended will get rejected.

I would like to request to update EDD with following caveat as well.
since subjob goes to Q state on server restart from a running or suspended state, node state will also sets back to free.

Hey Anamika,
It appears the EDD already covers this. The last bullet of the Misc section says that the node will go into state free if only admin-suspended subjobs are on it. If you want me to update this bullet, please let me know. I’m happy to.

Bhroam

Ah yes. I missed it.

Hi,

There was a design change (v.25) today that says “If a job is running on some but not all of the vnodes of a multi-vnoded host, only the vnodes the job is running on will be put into maintenance.”

I do not understand how this change supports the goal of this feature.

In any case, since the design has been stable for a long time (with no changes and no comments from the community), it is appropriate to provide an explanation and an opportunity for the community to comment on the change. Please add some explanation (here in the forum).

Thanks!

I’m sorry, this wasn’t made as a change to the design. It just clarifies what happens when a job is sent an admin-suspend signal. Since it was just a clarification, I didn’t bother posting about it. I will do so in the future.

This really just clarifies what interface 2 says: when a job is sent an admin-suspend signal, all of its nodes are put in the maintenance state. Maybe I should have used the term vnode instead of node, but to PBS a node is a vnode.

Bhroam

Thanks for the explanation. Unfortunately, I always interpreted “node” in this enhancement to mean the usual HPC definition (not the PBS Pro meaning of “vnode”); as it is possible that others also interpreted “node” as “node” and not “vnode”, I suggest you open this up for some comments, especially ask any actual target users.

If consensus says it should be “node” (not “vnode”), then I would suggest at least filing a bug and providing a documentation workaround to describe how one could achieve the “node” behavior.

Thanks again!

OK, I updated the document again. I clarified all references to nodes to be vnodes. I also added the permissions needed to use the admin-suspend/admin-resume signals (op/mgr, just like suspend/resume). I added some advice on how to use the feature on a multi-vnoded host. Lastly, I clarified that like all other pseudo-signals, these do not have signal numbers associated with them.

Once again, nothing is changing. The development work for this feature was completed 9 months ago. I know of at least one site that is currently using the feature.

Bhroam

My site is using the feature, and I don’t see any problems with the updated document (v. 28).

Great to hear @gmatthew, thanks! Does your site have multi-vnoded hosts (e.g., either hosts that use multiple vnodes to handle GPUs or systems like an SGI UV)? If not, then vnode = node, so there is no issue.

@bhroam: For a system like an SGI UV where some vnodes are free (not running any jobs), how can an admin put all the vnodes (the whole system) into maintenance mode? In particular, how does one ensure no new jobs get started on those nodes (e.g., including when the system is being slowly brought back out of maintenance mode), and also, so that other admins/operators can see that all vnodes are in maintenance mode?

Thx again!

@billnitzberg: I believe the information is in the document. First you admin-suspend all jobs on the vnode. You then offline any vnodes in the free state. I don’t see any issue when the machine is slowly coming back out of maintenance. Vnodes will slowly move back into the job-busy or free state.

I don’t understand your last question. Maintenance is a node state. It shows up in pbsnodes.

Bhroam

Yes we have SGI UVs, but we’re not currently setup to make use of this feature on those machines. Even so, the design as presented in v.28 seems acceptable to me for eventual use on our UVs.

Thanks @bhroam & @gmatthew for the useful clarifications!