Offline_vnodes should only offline vnodes belonging to more than one Mom when all the MoMs are offline

lisa-altair · February 9, 2018, 12:24am

Please take a look at my proposed behavior change when the hook fail_action is offline_vnodes.
https://pbspro.atlassian.net/wiki/spaces/PD/pages/195002450/Draft+offline+vnodes+should+only+offline+vnodes+belonging+to+more+than+one+Mom+when+all+the+MoMs+are+offline

Let me know what you think.

billnitzberg · February 9, 2018, 12:43am

Hi, can you give an example (or a couple examples) of what’s happening now (that’s bad) and what the new behavior will be (that’s good)? Also, what happens when only some (but not all) of the MOMs are offline? Thx!

jon · February 9, 2018, 12:53am

I would suggest a different approach to handling the issue. I know that on a Cray XC system that shares vnodes among moms that it has a negative behavior of setting the whole cluster to offline if we offline one mom. I would propose that we change the behavior to offline the vnodes associated to the host and not offline the Cray mom. Thoughts?

lisa-altair · February 9, 2018, 1:02am

Hi @billnitzberg, what’s happening now that’s bad is that only one mom has to be marked offline before all of her children vnodes are also marked offline. This is bad because PBS is marking way more nodes offline than it needs to. Those children vnodes could still do work if there’s at least one MoM reporting it that is still free. I tried to mention this in my design doc. I have now added an example. How can I make it even clearer? Thanks.

lisa-altair · February 9, 2018, 1:05am

Hi @jon that’s exactly what I’m proposing.
However, I take that same concept forward to when all of the MoMs are offline…at that point there are no moms which can be used to reach the shared vnodes, so PBS will also mark those shared vnodes as offline.

lisa-altair · February 9, 2018, 1:20am

Clearly the design isn’t clear enough. I took a stab at modifying the external design…please take a look.

bhroam · February 9, 2018, 2:27am

I personally think this is just fixing of a bug. It isn’t really a design change. I think the original design was for single mom hosts. I don’t think the idea of multi-mom hosts was taken into account.

I agree with @lisa-altair. When a mom goes offline, don’t mark all the vnodes offline until all the moms reporting those vnodes go offline.

Bhroam

jon · February 9, 2018, 2:35am

that makes more sense. Thanks for clarifying it. I have no further comments.

billnitzberg · February 9, 2018, 4:25am

Thanks. Got it, sounds good.

Just to double check…

My understanding is that a vnode will be marked offline if all the MOMs handling that vnode are “bad”; if there is at least one “good” MOM handling a vnode, then the vnode is not marked offline.

mkaro · February 9, 2018, 4:47pm

The original design supported multiple MoMs, and actually worked quite well. Of course, that was before we had hooks. Now that a hook can offline a mom, the design needs to take that into account.

mkaro · February 9, 2018, 4:57pm

So let me see if I understand the logic here… two Moms, moma and momb, running on different login nodes on a Cray system, sharing the same set of compute nodes (cn1, cn2, cn3, …). A hook is configured on both moms with offline_vnodes enabled for failure. The MoMs know they are running on a Cray. The hook fails and moma is marked offline. Because the MoM is running on a Cray, it only mars itself offline, not the vnodes. The other mom and the compute nodes are still available. Now the hook fails on momb, and momb is marked offline. Again, the MoM knows it’s running on a Cray and only marks itself as offline, not the vnodes. The server would then recognize that ALL moms providing access to the compute nodes are down and proceed to mark the compute node (vnodes) offline.

Do I have that right?

bhroam · February 9, 2018, 6:56pm

I agree that the multimom design worked quite well. When a mom is needed, the one with the fewest jobs will be chosen. That does the job well. I was talking about the design for moms knocking vnodes offline. I view the case Lisa is discussing as a bug, rather than an RFE.

Bhroam

billnitzberg · February 9, 2018, 6:59pm

Thanks, with the example, v.6 is clear to me and looks good.

scc · February 9, 2018, 8:43pm

It isn’t just hooks, though. Given the nodes presented in the example in the EDD running pbsnodes -o mom1 will today offline mom1 AND vn1 vn2 and vn3 even though mom2 is just fine and could continue running jobs on vn1 vn2 and vn3. Using qmgr -c “s c mom1 state+= offline” will NOT result in vn1 vn2 or vn3 getting offlined. The present EDD does not mention the pbsnodes -o case, so I assume that is not covered/changed here, or is it just missing from the EDD and the same underlying change will “fix” (in my opinion) the pbsnodes -o problem? This behavior has caused customers to unintentionally halt jobs from running across their entire system in the past.

lisa-altair · February 9, 2018, 8:47pm

Yes @mkaro that’s right.

mkaro · February 9, 2018, 8:59pm

I agree with the design and share @scc concerns regarding pbsnodes.

lisa-altair · February 9, 2018, 9:00pm

Correct, this change is only for the hooks offline_vnodes case.

arungrover · February 9, 2018, 9:20pm

Design looks good @lisa-altair

vccardenas · February 9, 2018, 9:32pm

@lisa-altair, the EDD looks good to me.

lisa-altair · July 12, 2018, 4:37pm

I have modified the design to also include pbsnodes -o behavior.
Please have a look and provide comments. Thanks!

Topic		Replies	Views
PP-587: Have only one mom report the compute node information to the server Developers	10	1380	August 8, 2017
PBS Design Changes for Shasta support Developers	11	1111	March 27, 2020
New configuration variable: PBS_MOM_NODE_NAME Developers	25	5360	August 4, 2016
Hook;pbs_python;Server and MoM vnode names may not be consistent Users/Site Administrators	0	547	October 6, 2020
Offlined by hook 'pbs_cgroups' due to hook error Users/Site Administrators	9	1602	March 8, 2022

Offline_vnodes should only offline vnodes belonging to more than one Mom when all the MoMs are offline

Related topics