PP-389: Allow the admin to suspend jobs for node maintenance

Hey All,
I am working on PP-389. This feature will allow admins to suspend jobs on nodes to allow for maintenance. Unlike normal suspension of jobs, they nodes are not freed up. The nodes are put in a new ‘maintenance’ state until the admin is finished. The admin can then resume the jobs.

I thought I’d see if I could spark a conversation about it here.

Please see the following External Design Confluence Page for more details.

Interface 1 details: says “when the last job is resumed with the new resume pseudo signal” shouldn’t this read “when the last admin suspended job on the node is resumed with the new resume pseudo signal”

In the Interactions section you say “If there are multiple jobs on a node, it is not recommended to mix and match suspend signals. If this happens it is possible for a node to be put back into a schedulable state prior to all of the jobs being resumed. The scheduler could then run jobs on the resources owned by the jobs that are still suspended.” Is this only possible if all of the admin-suspend jobs have been resumed? I think it should read something like “If there are multiple jobs on a node, it is not recommended to mix and match suspend signals. If this happens it is possible for a node to be put back into a schedulable state prior to all of the non admin-suspend jobs being resumed. The scheduler could then run jobs on the resources owned by the non-admin suspended jobs that are still suspended.”

Good clarifications, Jon. I’ve made the updates.

Maybe I am just confused, but these two sentences seem at odds with one another:

The job’s substate is changed to let the scheduler know to resume the job.
The admin-resume pseudo signal will directly resume the job (no waiting for the scheduler).

If the job is resumed without waiting for the scheduler, what is that first sentence actually saying the scheduler does or will eventually do once it sees the new substate on the job?

Also, minor, but this: “…state when the admin-suspended last job is resumed…” should be reworded to: “…state when the last admin-suspended job is resumed…”

Hey Scott,

Thank you for your comments.

What I am trying to differentiate here is the difference between a normal resume and an admin-resume. During a normal resume, the job’s substate goes from 43 (suspend by user) to 45 (suspend by scheduler). The scheduler ignores jobs in substate 43. The scheduler will look at jobs in substate 45 and resume them. During an admin-resume, the job is directly resumed without having to wait for the scheduler.

I’ve updated the wording similar to what you suggested. I didn’t want to use the word admin-suspended yet because I haven’t introduced the term. It’s introduced in the next interface.

Thanks,
Bhroam

The EDD is looking good. Here are a couple of suggestions:

If a job that is suspended via admin-suspend is resumed with the normal resume pseudo signal, the job will not resume.

I would assume there’ll be a message coming out of qsig when this happens. Please add the message to the EDD. Same case too where if a normally suspended job is admin-resume, there’ll be an error message which should be documented.

Before admin-suspending jobs, it is recommended to disable scheduling and wait for the current scheduling cycle to finish.

I recommend the code to check if this was issued, to check if scheduling=true and emit a warning message to disable scheduling.

Hey Al,
Thanks for reviewing my EDD.

You make a good point here. I wasn’t planning on printing an error message since the job is suspended as normal. I can’t come up with any use cases where an admin would want to suspend a job with admin-suspend and resume it with resume. The reverse does have a weak use case. Currently there is no way to resume a job without having the scheduler do it for you. Using admin-resume would directly resume the job suspended with the suspend signal. It’d be kind of like a qsig -Wforce. I don’t like it though. If I wanted to have a qsig -Wforce, I’d implement qsig -Wforce. I’ll reject the signals and emit an error message.

This is a recommendation and not a hard and fast rule. I’d make the same recommendation when an admin offlines a node. I’d rather not emit a warning here. One isn’t emitted for offlining a node.

Bhroam

Hi,

I really like this new capability – resilience and tools for troubleshooting and maintenance are really important in HPC – and it’s great to see the design being discussed and iterated on. Now a comment…

From a user experience perspective, the interface appears backward. If the goal is to perform maintenance on a set of nodes, it would be more natural to specify the set of nodes on which to perform maintenance (e.g., like setting nodes offline via pbsnodes). The initial design requires the admin to specify jobs (after figuring out which jobs are running on the nodes).

Specifying the set of nodes (on which to perform maintenance) would also eliminate some potential race conditions between when the admin determines the node-job associations (e.g.,via qstat), and having jobs end, or having the scheduler start new jobs, or having jobs release nodes (via that new feature that allows jobs to free unused nodes while running).

Also, if the interface is job-based, how does it behave with job arrays and subjobs?

As you have pointed out there are two different approaches to this problem. One is to come at it from the node perspective and one is from the job perspective. Now the question is how will an admin approach the problem.

  1. Will they first set a subset of node(s) into a maintenance state and then have PBS suspend all of the jobs still running on these nodes?

  2. Will they schedule a dedicated time using the dedicated time feature and then suspend the jobs still running? Then when the time comes to resume

  3. Will they disable scheduling and then suspend all currently running jobs?

  4. Will they offline a set of nodes and then suspend all currently running jobs?

For option 1, the issues I see with it is for jobs sharing nodes once a node is put in the maintenance mode then we would have to track down which nodes belong to the jobs on those nodes and put those nodes in maintenance mode. If those nodes have multiple jobs on them then we would need to place additional nodes in maintenance mode in order to allow the scheduler to give away those resources. Now how does the admin know which nodes he placed in maintenance mode and which ones we place in maintenance mode? At this point it can start to get messy

For option 2, the scheduler is already not starting new jobs that would cross the boundary so the race condition no longer exists. Once the dedicated time starts the admin will know which jobs need to be suspended and can operate accordingly. Once the dedicated time has expired the scheduler can resume scheduling on the nodes that were not put in to maintenance mode, and the admin can resume the jobs on his/her time scale.

For option 3 and 4, it is essentially option 2 except that they are manually doing the dedicated time by either stopping the scheduler or offlining the node(s).

How I see it, option 2, 3, and 4 are much more common scenarios so coming at it from the job perspective seem to make more sense to me.

Not sure how this would change anything since a sub job behaves like a regular job in the running state as far as I know (except for a server restart that is).

I am sure i am missing something here, so please help me understand. For option 1: If we go with the "mark node for maintenance"approach and suspend jobs that are running any ranks on each of such node, that should suffice right? Why would we need to track down which nodes belong to the jobs on those nodes and put those nodes in maintenance mode?

I discussed this with Jon a few times last week so I’ll chime in while he’s on the road. At least one reason is that presumably you do not want the scheduler to start other jobs using the admin-suspended job’s “resources-that-were-freed-by-admin-suspend” on any of the nodes that the job was running on. In other words, the idea is that the admin will want to be able to admin-resume the job after the node maintenance without having to worry about what the scheduler may have done with any of the resources originally allocated to the job…

There are other ways to accomplish that internal to the scheduler, but putting all nodes that the admin-suspended job is running on into the maintenance mode seems the most transparent since it’ll show up in pbsnodes, rather than having the scheduler silently (unless sufficient logging is enabled) not allocating resources belonging to a job which is in an admin-suspend state.

Sure! That is understood.

My question was, why go after a job and “mark” nodes that are running the
job? Why not go after it node wise - ie, just mark the nodes that the admin
needs directly into maintenance mode (and then suspend the jobs running on
those nodes). That should have sufficed. No? Why do we then have to find
additional nodes that were running parts of the job that were running in
these admin selected nodes?

Subhasis

Ah, the thinking there is that the application is likely to react badly if you just suspend the processes on a single node in the job rather than all job processes on all nodes. It’d be application dependent, of course, and also depend on whether you are performing maintenance on the primary exec host for the job or one of the secondaries.

Taking the admin-suspend action on a per job basis saves the admin from having to compile a list of not only the nodes they want to perform maintenance on, but also compile a list of all of the other nodes involved in jobs running on the nodes which will be put in maintenance state, and then somehow placing those nodes into the maintenance state (which presumably would admin-suspend all of the job processes on them for any job(s) running on them). But what if those other nodes are running multiple jobs as well? The admin then has to go and put the rest of the nodes that THOSE jobs are running on into the maintenance state, and so on and on.

When the action is taken on a job vs. node level it would not spiral out of control in this way. More power is given to the admin to do the “right thing”. If a single node is running parts of 2 multi node jobs, processes belonging to job1 (which has been admin-suspended) can be suspended, the node can be put into maintenance mode so that job1’s resources on that node are not allocated to some other job(s), and the processes belonging to job2 can be allowed to continue running since none of the nodes that job2 is running on will ACTUALLY be undergoing maintenance (of course, the admin can and is indeeed 100% responsible for admin-suspending job2 as well if maintenance is ACTUALLY planned for the node in question, but that may not be the case).

I agree that it SEEMS like we are coming at this backwards as well (I actually mentioned it to Jon last week before Bill’s note), but I have been convinced that it makes the admin’s job easier to come from the job perspective vs. node.

The EDD looks good to me. I like the idea of admin-resume operating without the need of scheduler, but this raises an implementation question that will need some testing: how quickly will we expect admin-resume operations to occur in the life-is-good case (all/most jobs expected to resume without trouble), and in the not-so-good case (all/most jobs will fail to resume because 1+ nodes of jobs were unfortunately rebooted/powered down during the maintenance activity).

Regarding job-centric versus node-centric targeting of maintenance - I think in the general case the admin may want either or both. The majority of our use cases are related to filesystem maintenance, and this leans toward job-centric suspend at NAS because of user/job partitioning among filesystems. Recently, though, we’ve made progress on maintenance of infrastructure systems (e.g. rack leaders) that leans toward node-centric suspend. I’d like to see Altair’s work give early results on job-centric suspend, without causing harm to the possibility of future node-centric suspend functionality.

That explains. Thanks Greg.

Scott, I was not suggesting we suspend only a few process of a multi-node job. I was suggesting why we are not going about doing this the node way. If the admin wants a few nodes to be freed for maintenance, we could do the following:

for each node in admin list for maintenance
{
for each job in $node
{
Suspend the entire job (multi-node job will get suspended on all nodes automatically)
}
Mark the node as maintenance node
(no need to go to the other nodes that a multi-node job would have touched - the job is suspended there anyway)
}

Jon explained if we do not also mark the other nodes that the job would have been running on in maintenance, then there is a chance of scheduler over-subscribing the node. That over-subscribing wont happen if we let the scheduler itself handle the resume of the jobs when the nodes are brought back from maintenance mode.

But looks like NAS has a different workflow so this would work for them.

Hey Subhasis,

If I am understanding you correctly, I believe your suggestion will cause some problems. When a job is suspended, the server releases the resources held by the job. The scheduler is free to use these resources to run other jobs (think preemption). If we only mark the nodes the admin tells us to in maintenance, all of the other nodes the jobs are running on will be freed up. This will delay restarting the job when the nodes are released from maintenance(or cause over-subscription). The way the feature is currently designed, this won’t happen because all of the nodes for a job will be marked in maintenance. Any node which is in maintenance will be ignored by the scheduler.

At one point I thought doing this feature on a per-node basis would be better. After thinking through all of the strange side effects it caused me to change my mind. First off, if you put node X into maintenance, all of the jobs on X would need to be suspended. This would put all of the nodes of those jobs into maintenance. You’ve said do one thing and a whole lot more happened. It might be confusing. This is if you stop right there. Another choice is to continue suspending. If you suspend one job on a node, all of the jobs on that node need to be suspended(and so on). This could easily grow until you’ve suspended all of the jobs. This is an extreme case, but it could happen.

It may be a bit more work to suspend all of the jobs on a node with the current design, I think it’s better. At least the admin knows exactly what is going on.

NAS’s use case is a bit more simple than what is suggested above. All of their nodes are assigned exclusively, so there will only be one job per node. If we provided a per-node suspend, it would only suspend one job.

Bhroam

The EDD looks good to me. I approve of what you currently have.

Thanks @Bhroam. That makes it very clear.

Regards,
Subhasis

Hmmm, I feel there is some confusion regarding interface versus implementation…

If the maintenance is to be performed on a per-node basis, as is being proposed, then the admin interface is more naturally expressed per-node. The implementation can still be per-job (having PBS do “the right thing” automatically, even marking additional nodes as being in maintenance mode, without burdening the admin).

For example, assuming the admin wants to perform maintenance on nodes n1-n5, and assuming jobs A, B, and C are running on these nodes. A natural per-node interface would be something like:

pbsnodes --maintenance-mode-on n1 n2 n3 n4 n5

Which would tell PBS to automatically detect that jobs A, B, and C are the only ones affected, and then implement all the right actions for jobs A, B, and C, as well as any additional actions necessary to other nodes also used by jobs A, B, and C, etc. This is just a different (more natural) way to express { A, B, C } to PBS. It also has the advantage that PBS can automatically take care of blocking new jobs from starting (without having the admin explicitly turn off scheduling) and it mitigates the race condition issue of jobs shrinking (resulting in too few nodes to be marked for maintenance).

The suggestion is to match the interface to the use case (versus matching the interface to the implementation).