Low level default value for max_attempts in PTL functions

PTL’s expect() and log_match() functions have "max_attempts” parameter through which test writers can specify the number of attempts to,

  • perform an operation before an attribute matches the given value. (or)
  • find a matching entry , respectively
    Test writers are free to specify any desired max_attempts value. However it is often observed that if very small ‘max_attempts’ value are specified in these functions, tests fail intermittently when run on some slower test machines.

Hence the proposal is to have a low default value of 20 for ‘max_attempts’ and ignore any user specified ‘max_attempts’ value that is less than 20. For example, if test writer sets max_attempts to 5, these functions will internally set ‘max_attempts’ to 20. Any value greater than 20 will not be modified. This applies to both positive and negative cases of expect() and log_match(). tracejob_match() and accounting_match() also use log_match() internally. Hence the docstrings for all the 4 functions expect(), log_match(), tracejob_match() and accounting_match() will be updated to reflect this change.

Please let me know your comments on this

I do not agree with this change for expect(). I remember in some test cases I wrote, where PTL should mark it as failed if the first expect fails, no matter how slow the machine is. To achieve this I’m using expect() with max_attempts=1. If this proposal is implemented on expect(), then I fear that PTL might mark such test cases as passed when actually they have failed. I’m fine if there is some other way to achieve this hard requirement of one time validation.

I have copied below an example test case where max_attempts=1 is heavily used.

    def test_running_subjob_survive_restart(self):
        """
        Test to check if a running subjob of an array job survive a
        pbs_server restart
        """
        a = {'resources_available.ncpus': 1}
        self.server.manager(MGR_CMD_SET, NODE, a, self.mom.shortname)
        j = Job(TEST_USER, attrs={
            ATTR_J: '1-3', 'Resource_List.select': 'ncpus=1'})

        j.set_sleep_time(20)

        j_id = self.server.submit(j)
        subjid_2 = j.create_subjob_id(j_id, 2)

        # 1. check job array has begun
        self.server.expect(JOB, {'job_state': 'B'}, j_id)

        # 2. wait till subjob 2 starts running
        self.server.expect(JOB, {'job_state': 'R'}, subjid_2, offset=20)

        # 3. Kill and restart the server
        self.kill_and_restart_svr()

        # 4. array job should be B
        self.server.expect(JOB, {'job_state': 'B'}, j_id, max_attempts=1)

        # 5. subjob 1 should be X
        self.server.expect(JOB, {'job_state': 'X'},
                           j.create_subjob_id(j_id, 1), max_attempts=1)

        # 6. subjob 2 should be R
        self.server.expect(JOB, {'job_state': 'R'}, subjid_2, max_attempts=1)

        # 7. subjob 3 should be Q
        self.server.expect(JOB, {'job_state': 'Q'},
                           j.create_subjob_id(j_id, 3), max_attempts=1)

We see a lot of issues with tests failing on slow machines. How about we do this:

  • Add a new option to pbs_benchpress, let’s call it --timeunit. The value of ‘timeunit’ will be a multiplier to time.sleep function’s sleep value. (We can simply override time.sleep and write our own which will just multiply the time value with timeunit).
  • By default, timeunit = 1 sec, for slower machines, increase it to 2 or 3 seconds, as is appropriate. This will automatically make PTL expect()/log_match() try for a longer duration (x attempts * 3 seconds vs x attempts * 1 second).

That way, we won’t have to use approaches which increase the run time of PTL tests on fast systems.

@Shrini-h. Thanks for pointing out this test case. But I don’t quite understand the need for max_attempts=1 here. I may be missing something here. Why should PTL mark this as fail if the first expect failed? I can see that if there were more attempts to match the job state to ‘B’ or ‘X’ in the first two expect calls, the third one might fail as subjob 2 (sleep job of 20sec) might have finished while we expect it to be running. In my opinion, such tests have to be dealt differently instead of keeping max_attempts to 1. Please correct me if my understanding is incorrect.

@agrawalravi90, Thanks for looking into this. Why do you think that this approach will increase the run time of PTL tests on fast systems ? expect() and log_match() would anyways return as soon as the expected value or match is found even though max_attempts was set high. I agree there will be increase in run time for cases like looking for non-existence of a log message. Are those the cases you are worried about ?

I dislike this for existence=False checks. One check in particular bothers me. There is a check in the scheduler’s revert_to_defaults() that checks for the non-existence of an error message in the scheduler log. It has max_attempts=2. This means you’ll increase this check for all tests from 2 seconds to 20. Furthermore this same check is done when you do a self.scheduler.set_sched_config(). Scheduler tests will start taking a long time for a not very good reason. The vast majority of times, the error message will not be printed.

I don’t have a problem with a minimum max_attempts=20 for the existence=True case.

Bhroam

Yes, I don’t think such cases are rare

Ok. This proposal is having drawbacks.

I can see @agrawalravi90’s idea as much similar to what we achieve via ’interval’ parameter in log_match() and expect() functions (but customization can be done per test machine in this idea). We could do this even without overriding time.sleep. @agrawalravi90, Do you see any other use cases for overriding time.sleep? May be the test “timeout” ? It’s another candidate that causes intermittent failures in slow test machines.

I am open to other suggestions too. Any other thoughts?

hey Latha, I hadn’t really given it too much thought when I mentioned that my solution could be implemented simply by overriding time.sleep. Now that I think about it, I think it’s not a good idea because I remember seeing tests which set up things like reservations or dedicated time and sleep until that time window starts. If we override time.sleep, it will adversely affect such tests. So, I actually take that back, we should not implement it by overriding time.sleep, maybe just changing the default interval between attempts is the right way to do it.

I agree with Bhroam; only put the minimum for existence=True. If existence=False, then don’t change it.

@lsubramanian, Sorry for the late reply.

The TC is to check if the subjobs are recovered back to the same states as they were just before the server restart. Imagine if I had not used max_attempts=1 and the system under test wrongly recovered the subjobs to Q instead of X/R state, Now instead of failing the test case, the expect will wait for 60 attempts by which the subjob can transit to the expected state and expect will wrongly mark the TC as passed.

You are right that the expect on third subjob might fail. I agree I could have done a better design while writing the test case.

Though I still feel max_attempts=1 is a useful case.

Thanks everyone for giving your suggestions.
I like the approach of adding a new option to pbs_benchpress to set customized ‘interval’ time for slower test machines. Slower test machines can set a value greater than the default interval of 0.5 seconds. This approach seems more wholesome and robust than making low level default value for max_attempts for ‘existence=True scenarios’ alone.

1 Like

Sounds good Latha. I suggest starting a new thread when you have a design ready. Thanks.