Offline_vnodes should only offline vnodes belonging to more than one Mom when all the MoMs are offline

Please take a look at my proposed behavior change when the hook fail_action is offline_vnodes.
https://pbspro.atlassian.net/wiki/spaces/PD/pages/195002450/Draft+offline+vnodes+should+only+offline+vnodes+belonging+to+more+than+one+Mom+when+all+the+MoMs+are+offline

Let me know what you think.

Hi, can you give an example (or a couple examples) of what’s happening now (that’s bad) and what the new behavior will be (that’s good)? Also, what happens when only some (but not all) of the MOMs are offline? Thx!

I would suggest a different approach to handling the issue. I know that on a Cray XC system that shares vnodes among moms that it has a negative behavior of setting the whole cluster to offline if we offline one mom. I would propose that we change the behavior to offline the vnodes associated to the host and not offline the Cray mom. Thoughts?

Hi @billnitzberg, what’s happening now that’s bad is that only one mom has to be marked offline before all of her children vnodes are also marked offline. This is bad because PBS is marking way more nodes offline than it needs to. Those children vnodes could still do work if there’s at least one MoM reporting it that is still free. I tried to mention this in my design doc. I have now added an example. How can I make it even clearer? Thanks.

Hi @jon that’s exactly what I’m proposing.
However, I take that same concept forward to when all of the MoMs are offline…at that point there are no moms which can be used to reach the shared vnodes, so PBS will also mark those shared vnodes as offline.

Clearly the design isn’t clear enough. I took a stab at modifying the external design…please take a look.

I personally think this is just fixing of a bug. It isn’t really a design change. I think the original design was for single mom hosts. I don’t think the idea of multi-mom hosts was taken into account.

I agree with @lisa-altair. When a mom goes offline, don’t mark all the vnodes offline until all the moms reporting those vnodes go offline.

Bhroam

that makes more sense. Thanks for clarifying it. I have no further comments.

Thanks. Got it, sounds good.

Just to double check…

My understanding is that a vnode will be marked offline if all the MOMs handling that vnode are “bad”; if there is at least one “good” MOM handling a vnode, then the vnode is not marked offline.

The original design supported multiple MoMs, and actually worked quite well. Of course, that was before we had hooks. Now that a hook can offline a mom, the design needs to take that into account.

So let me see if I understand the logic here… two Moms, moma and momb, running on different login nodes on a Cray system, sharing the same set of compute nodes (cn1, cn2, cn3, …). A hook is configured on both moms with offline_vnodes enabled for failure. The MoMs know they are running on a Cray. The hook fails and moma is marked offline. Because the MoM is running on a Cray, it only mars itself offline, not the vnodes. The other mom and the compute nodes are still available. Now the hook fails on momb, and momb is marked offline. Again, the MoM knows it’s running on a Cray and only marks itself as offline, not the vnodes. The server would then recognize that ALL moms providing access to the compute nodes are down and proceed to mark the compute node (vnodes) offline.

Do I have that right?

I agree that the multimom design worked quite well. When a mom is needed, the one with the fewest jobs will be chosen. That does the job well. I was talking about the design for moms knocking vnodes offline. I view the case Lisa is discussing as a bug, rather than an RFE.

Bhroam

Thanks, with the example, v.6 is clear to me and looks good.

It isn’t just hooks, though. Given the nodes presented in the example in the EDD running pbsnodes -o mom1 will today offline mom1 AND vn1 vn2 and vn3 even though mom2 is just fine and could continue running jobs on vn1 vn2 and vn3. Using qmgr -c “s c mom1 state+= offline” will NOT result in vn1 vn2 or vn3 getting offlined. The present EDD does not mention the pbsnodes -o case, so I assume that is not covered/changed here, or is it just missing from the EDD and the same underlying change will “fix” (in my opinion) the pbsnodes -o problem? This behavior has caused customers to unintentionally halt jobs from running across their entire system in the past.

Yes @mkaro that’s right.

I agree with the design and share @scc concerns regarding pbsnodes.

Correct, this change is only for the hooks offline_vnodes case.

Design looks good @lisa-altair

@lisa-altair, the EDD looks good to me.

I have modified the design to also include pbsnodes -o behavior.
Please have a look and provide comments. Thanks!