PP-302: Implement save of PBS data for post-run analysis

Hi All,

We (myself, @kjakkali @arungrover @iestockdale @anamika @subhasisb @crjayadev) had discussion on how PTL should save post analysis data on failure and here are notes from that discussion.

Notes from dicussion:

  • Make use of pbs_diag command from PBS to save post analysis data on failure
  • Disable saving of post analysis data for testsuite if failures are more than 3 for that testsuite
  • Count (i.e. 3) mentioned in above note should be configurable

@kjakkali @arungrover @iestockdale @anamika @subhasisb @crjayadev Please let me know if I missed any point from our discussion.

Thanks,
Hiren

Hi All,

Based on above points I have updated design for saving post analysis data on failure in PTL.
Please have look and provide your comments.

Also currently pbs_diag command doesn’t store output of netstat (only with -p), uptime, vmstat and df -h commands. I have logged another ticket (https://pbspro.atlassian.net/browse/PP-609) for same.

Hi Hiren, now I am thinking more about it I am preferring to have a formula instead of number 3 for example, say default is 10% of all test cases in that test suite.

Also for --max-post-analysis-data, I think this can only be set if --post-analysis-data is set. would there be an error or warning one it is set without --post-analysis-data? Kindly update the same in EDD.
Also warning/error when a wrong format is provided.

Is there a maximum number of limit for --max-post-analysis-data value?

" After saving post data for 3rd failure in testsuite PTL will disable saving post data for that testsuite"
should not this be “it will stop saving data per test case level for that test suite but it will still save the data per test suite level”. I think it would be still helpful for saving data per test suite level at the end of the test if there are failures.

Whether we use a test case counter or a formula, one key point from our discussion was that the threshold be tunable. For example, have a default of 3, but provide an interface so that you can easily reset that to some other number.

Hi @anamika

Why formula in --max-post-analysis-data? Why not simple number? What we will gain using formula instead of using simple number? Don’t you think it will be too much complexity for user to find % while overriding it?

Regarding validation on --max-post-analysis-data, currently PTL does not validate value or conflicting options (accept unknown argument). So, I will keep that way for now. If you think it should be done then please file an separate issue for same for all option in PTL. With this I think there is not need to of updating “design document”.

No there is not limit in --max-post-analysis-data value. Will update same in “design document”.

What we will gain saving data at testsuite level when you have ‘–max-post-analysis-data’? if you want more number of post data then just override max count by providing ‘–max-post-analysis-data’ with large number.

Well I actually like @anamika 's suggestion of having it as %age rather than a number. This will make it more dynamic in nature as not all test suites are going to have same number of test cases.

About saving PBS_HOME data after we hit a failure limit, I was of the same opinion that after we hit a limit we will still store data on test suite level (if more failures are detected) at the end. This is because we can not assume all failures are going to be same.

The simple counter is a good fit if you believe that the common use case is a failure wherein most or all test cases fail for the same reason. If you believe that the common use case is a large number of different failures, then the formula may be more suitable.

I tend to think that most failures fall into the former class, so would suggest starting with the simple counter.

I agree, we should start with a simple counter.

We had some more discussion and we were thinking that we should have 3 simple counters, hence summarizing them.

While we would want to save space by limiting the number of failure analysis data collected, we will also want to save time by stopping rest of the test cases when there are too many failures.We could use the same threshold for limiting saving the failure data and stop running the test cases but there may be cases where we may want to limit saving data but still want to run more test cases. So we may, at times, want different values for these thresholds.

These thresholds are at a test suite level. At a regression level, we may want to stop running the regression if there are too many failures. All these three can be understood and could be implemented at the same time, so broadening the scope of this discussion.

Have 3 Global parameters

TC_FAILURE_THRESHOLD <default - 5>- If number of failures in a test suite exceed this count, the execution of the test suite does not progress for the test suite
CUMULATIVE_TC_FAILURE_THRESHOLD <default - 50> - If the number of failures in the whole regression exceeds this count, the execution of the regression is stopped.
MAX_DIAG_THRESHOLD <default=TC_FAILURE_THRESHOLD> - The pbs_diag data is collected per failure. When the number of failure exceeds MAX_DIAG_THRESHOLD within a test suite, we shall stop collecting the diagnostic data.

TC_FAILURE_THRESHOLD & CUMULATIVE_TC_FAILURE_THRESHOLD is intended to save time in running regression, if there are too many failures at a test suite level or at an overall level respectively.
MAX_DIAG_THRESHOLD is intended to save space when there is too many failures in a build.

We should also have a command line option to benchpress to override these values.

example:
–tc_falure_threshold=20 if you want 20 as a threshold
–cumulative_tc_failure_threshold=150

“0” can be specified as the value, if you do not want any threshold to apply.

I have seen common failure is different types of failures. I won’t say large as I am not sure if it means all or more than 50% of tests. I still feel allowing % is a good way to define what do you mean by large.
Also if the use case is really to save space by avoiding saving data for similar failures then we might want to add more intelligence to our framework say by comparing the error message, if it is same for first 3 or so failures then stop.

@iestockdale @crjayadev @arungrover @kjakkali @anamika @agrawalravi90

I have updated design document as per our last discussion for three new option ‘–tc-failure-threshold’, ‘–cumulative-tc-failure-threshold’ and ‘–max-postdata-threshold’.

Please review it and let me know you comments/suggestion ASAP.

Overall looks goodt. Only some cosmetic changes.

After taking pbs_diag output for core file, core file will be deleted from PBS_HOME directory
If this is a pbs_diag functionality, we can be specific that this is the behaviour of pbs_diag and not a change requested by this design.

“Testcases failure for this testsuite count exceeded testcase failure threshold (count)”
“Testcases failed for this testsuite exceeded the testcase failure threshold (count)”

if is not integer then PTL will bail out with error message “Value for testcase failure threshold should be integer”.

A suggestion for the message reported “ERROR: Invalid TESTCASE_FAILURE_THRESHOLD provided, please provide integer value” - similar changes can be considered for other messages defined later.

PBS_.tar.gz

This file is nothing but PBS diag saved by pbs_diag command

pbs_diag generates ’ pbs_diag_yymmdd_hhmmss.tar.gz’ file . After running pbs_diag command does PTL rename pbs_diag output file to ’ PBS_hostname.tar.gz’ ?

            Also PTL will run pbs_diag command with -c <path to core file> if all core files found in PBS_HOME directory
                The output of above command will be store in <core file name>.out
                After taking pbs_diag output for core file, core file will be deleted from PBS_HOME directory

-c is for ONLY the cpuset information
-g is for (-g core_file) core file

Will PTL exit with an error if user passes --tc_falure_threshold=5 (less than default value) as MAX_DIAG_THRESHOLD default is 10?

Yes it will exit with error message “Value for max-postdata-threshold should be less or equal to ‘tc-failure-threshold’”

@crjayadev @kjakkali

I have updated design document as per your comments regarding error message, -g instead of -c and about rename of tar.gz after pbs_diag

Please review them.

Change looks good to me!

Thanks @arungrover for sign off.

@kjakkali @crjayadev Thanks review and comments, please sign off if you don’t have any further comments.

@iestockdale @anamika @agrawalravi90 Please review design document and sign off if you don’t have any comments.

I sign off on the spec to save PBS data for post-run analysis on test case failures.

@hirenvadalia : Can you please update design doc for ‘What should user do if he/she doesn’t want any threshold to apply?’

Hiren, latest design looks good. Thank you for making all the changes. I just have one minor comment. For “–max-postdata-threshold” can you explicitly add that first N failures data will be saved. Though I assume it is the case but saying it explicitly would leave no further doubts.

@kjakkali I have updated design document for your comments please have look again.