PTL tests often suffer from race conditions. Many of them seem like they wouldn’t have happened if PTL did an expect=True while doing operations like setting Scheduling to False. So, how about we do expect=True by default? Right now it’s set to False by default.
The call I think you are talking about is the self.server.manager() call. It defaults to False. I never quite understood why. Most of our calls that validate default to true (like setting options in the sched_config file). This one is different.
I like the idea of changing the default to True. Especially since an easy to make mistake is to turn scheduling off, submit some jobs or change some attributes and turn scheduling on and think nothing will take effect until scheduling is turned back on. If you turn scheduling off and you are in the middle of a cycle, the cycle will finish. This creates a race condition where the new jobs or attribute changes can be partially seen in current cycle and fail the test. When turning scheduling false with expect=True, manager() will wait until the current cycle finishes before returning.
Hi @agrawalravi90, your proposal looks good to me. I think server.manager is the only place where expect=False by default. Besides manager() it is also present in create_vnodes but it is True by default there.
So, I came across a complication while working on this: manager() call today can either return the return code of qmgr command (0 for success, non-0 for failure), or, if expect is True, the return value of expect() call, which is True(1) or False(0). So, if you call this function with expect=True, the definition of success, as determined by its return value, gets changed from 0 to 1. I’m thinking about changing this very confusing behavior to just 0 for Success and 1/non-zero for failure, even if expect is set to True. Please let me know if this sounds okay.
Hi @agrawalravi90, There is one other related race condition that I faced. This is when using manager() and status() calls in tests back to back. Even with an expect=True in a manager() call for setting scheduling to True, there seems to a few seconds delay for job attributes to get updated with the newer values. So the subsequent status() calls (that immediately appear after scheduling=True via manager()) sometimes do not get the latest values. Some existing tests have been working around this, by matching for ‘Leaving Scheduling Cycle’ message in the scheduler logs before making a status() call. But I feel it would be good for manager() call itself to validate this before returning. Any thoughts?
Sorry for the delay in responding to this Latha. I see what you are saying and like your suggestion. I spent a few minutes debating whether we should delay the test until the sched cycle is finished, and couldn’t come up with a scenario where it would be harmful to delay the test and verify that the sched cycle got finished when expect=True. But I think we should get inputs from others. Would you like to start a new discussion thread on this?