PP-832: Which scheduler to talk to while taking over from Primary

Hi All,

PP-832 presents an issue in a failover setup whereby the secondary while taking over from the primary checks only once if it can communicate with the scheduler on the primary host or not and proceeds accordingly.

I have written a summary of this issue and proposed two approaches to fix it over here.

I request the community to provide feedback.

Thanks,
Prakash

@prakashcv13,
Thanks for writing the EDD and I vote for solution 1. Second step in in this solution is

  • Any time the scheduler on the primary goes down, the secondary server will spawn a local scheduler.

I am assuming primary server is up in this case even though the scheduler went down. If yes just wanted to know why canā€™t the primary server itself start the scheduler locally (This way we donā€™t lose fairshare usage for one cycle, other configuration of sched etc)

If the second step in solution-1 applies only when secondary becomes active then we need to rephrase the above statement so as to reflect this.

Because Multisched is already checked in and If we are not considering Multisched we need to mention in EDD that it applies only to default scheduler for now but it can be enhanced to consider multiple schedulers as part of Multisched failover interface.

I see that I had not phrased step 2 in solution-1 properly, have corrected it now.

done!

Thanks,
Prakash

Thanks for starting this discussion Prakash. I like solution 2 better, the secondary should just spawn its own scheduler, itā€™s cleaner and more robust than solution 1 in my opinion.

Hi Prakash ,

Thanks for this discussion , I vote for solution 1 because with solution 1 we keep communicating with the scheduler on the primary host and that will preserve run-time fairshare information.

Regards,
Lovely

Thank you @Klovely and @agrawalravi90 for the inputs. fairshare is the main reason that the secondary tries connecting to the scheduler on the primary host.

Ok, I thought, perhaps naively, that since fairshare data is simply stored in a file called ā€˜usageā€™, the secondary could just copy that file over and use the information, or if itā€™s on a shared filesystem, then itā€™s available to the secondary as itā€™s only usage file. So, even if the secondary doesnā€™t contact primaryā€™s scheduler and always spawns its own scheduler, it can just use the fairshare data that was already there. If this is vaid, then approach 2 would be as good as approach 1 for fairshare. If itā€™s not valid and approach 1 is far better than approach 2 for fairshare, then ya we should go with 1.

I favor approach #2. If the reason for failover is network ā€œflakinessā€ then itā€™s possible packets are being dropped along the way. We would not want the secondary server relying on the primary scheduler which could lead to unexpected and confusing behavior. Best to send SCH_QUIT to the primary scheduler and spawn a local instance, which is unlikely to be affected by network problems.

Please keep in mind all of the code you are working with was written when RPP was used for communication. Perhaps the reason the secondary server attempts to contact the primary scheduler only once could have something to do with how RPP behaved. We may want to do things differently now that TPP is our primary means of communication.

While I was focussing only on the issue at hand, I like the idea of having the usage file on the shared filesystem.

@mkaro - Your reason has made me inclined towards solution 2, so I have updated it to have the ā€œusageā€ file on the shared filesystem.

2 Likes

Iā€™d also like to get @bhroamā€™s inputs on this.

@prakashcv13 is correct. The reason we have the secondary talk with the primary is for fairshare reasons. If we make a switch from one scheduler to another, we will lose some amount of fairshare data. The amount is one cycleā€™s worth. At the time we made this decision, it made sense to try and talk with the primary.

A conversation about this topic with @prakashcv13 and @subhasisb changed my mind on this subject. Failover is an exception to the rule. Itā€™s a very complex exception. Weā€™re talking about having PBS switch from one host to another and keep running smoothly. Any added complexity we add to this system is another way everything can fail. I now look at having the primary talk to the secondary as added complexity.

Option two has the secondary tell the primary to quit. Weā€™re once again in an exceptional case here. If the reason weā€™re failing over is due to network issues between the primary and the secondary, the primary might miss being told to quit. We are then in the worse case scenario. The primary is still up and running with a view of the fairshare usage. The secondary takes over and runs for a while. When the primary takes back over, the primary schedulerā€™s stale view of the usage takes back over. We lose all usage accumulated while the secondary was up.

I vote for a hybrid between one and two. Option one has the added complexity of having the secondary talk with the primary. Option two has the problem that the primary might not receive the signal to quit. I suggest the secondary ignores the primary and starts up its own scheduler. When the primary takes back over, it will
tell the primary scheduler to reread the usage data.

Iā€™m not sure you want the init script to always restart the scheduler. There are times when only one of the daemons is down and admin will use the init script to start it. Since the other daemons are up, they are ignored. If we always restart the scheduler, we will be creating a situation where we will be losing fairshare data when we donā€™t need to.

There is another way around this. The server runs a special cycle when it initially comes up. Itā€™s called SCH_SCHEDULE_FIRST. The scheduler can reread the fairshare usage on this special cycle.

Bhroam

@bhroam - thank you for your inputs. How about we include @agrawalravi90ā€™s suggestion to have ā€œusageā€ file in a shared location to the hybrid solution you are suggesting?

@prakashcv13 failover works with a shared PBS_HOME. The usage file is already stored there. There is no need to change anything to make it work :slight_smile:

Bhroam

:). I should have looked at the code before asking the question.If that is the case, we should also make the scheduler on secondary read the usage at the time of taking over by explicitly sending a SCH_SCHEDULE_FIRST. I have updated the design with a single solution now, please let me know your thoughts.

Thanks,
Prakash

The first time the secondary takes over, weā€™ll be starting the secondary scheduler, so no problems there. The scheduler will read the usage file when it starts up. The question is what happens to the secondary scheduler when the primary server takes back over? If we take the secondary scheduler down, then we wonā€™t have a problem. Itā€™ll be started again when the secondary takes over again. If we leave the secondary scheduler up, weā€™ll need to do what you said. Itā€™ll need to reread the usage.

I read your new design. I just wanted to be clear that there is work to be done to make the scheduler reread its usage on a SCH_SCHEDULE_FIRST. It doesnā€™t reread the usage now. I donā€™t know if you want to make that more clear in your solution.

Bhroam

Thank you @bhroam, I have updated the design to reflect that we need to make scheduler re-read usage data on receiving SCH_SCHEDULE_FIRST.

@prakashcv13 thanks for making the changes to your document. We should probably decide on the fate of the secondary scheduler when the primary server takes back over. Right now your solution is silent on that. Do you want to tell the scheduler to quit?

@bhroam, the current behavior itself is that the scheduler on the secondary goes away when the primary comes up.

I never knew that. It sounds like the right thing to do.

Hey Prakash,

Just clarifying something, the doc mentions that the primary will issue SCH_SCHEDULE_FIRST when it comes back up, donā€™t we need to do SCH_CONFIGURE instead? Thatā€™s where it frees conf.fairshare and calls schedinit(), which reads the usage file to recreate conf.fairshare. SCH_SCHEDULE_FIRST might also lead to that, it just doesnā€™t seem obvious from the code, so I wanted to clarify it once.

Thanks,
Ravi