Based on above points I have updated design for saving post analysis data on failure in PTL.
Please have look and provide your comments.
Also currently pbs_diag command doesnât store output of netstat (only with -p), uptime, vmstat and df -h commands. I have logged another ticket (https://pbspro.atlassian.net/browse/PP-609) for same.
Hi Hiren, now I am thinking more about it I am preferring to have a formula instead of number 3 for example, say default is 10% of all test cases in that test suite.
Also for --max-post-analysis-data, I think this can only be set if --post-analysis-data is set. would there be an error or warning one it is set without --post-analysis-data? Kindly update the same in EDD.
Also warning/error when a wrong format is provided.
Is there a maximum number of limit for --max-post-analysis-data value?
" After saving post data for 3rd failure in testsuite PTL will disable saving post data for that testsuite"
should not this be âit will stop saving data per test case level for that test suite but it will still save the data per test suite levelâ. I think it would be still helpful for saving data per test suite level at the end of the test if there are failures.
Whether we use a test case counter or a formula, one key point from our discussion was that the threshold be tunable. For example, have a default of 3, but provide an interface so that you can easily reset that to some other number.
Why formula in --max-post-analysis-data? Why not simple number? What we will gain using formula instead of using simple number? Donât you think it will be too much complexity for user to find % while overriding it?
Regarding validation on --max-post-analysis-data, currently PTL does not validate value or conflicting options (accept unknown argument). So, I will keep that way for now. If you think it should be done then please file an separate issue for same for all option in PTL. With this I think there is not need to of updating âdesign documentâ.
No there is not limit in --max-post-analysis-data value. Will update same in âdesign documentâ.
What we will gain saving data at testsuite level when you have ââmax-post-analysis-dataâ? if you want more number of post data then just override max count by providing ââmax-post-analysis-dataâ with large number.
Well I actually like @anamika 's suggestion of having it as %age rather than a number. This will make it more dynamic in nature as not all test suites are going to have same number of test cases.
About saving PBS_HOME data after we hit a failure limit, I was of the same opinion that after we hit a limit we will still store data on test suite level (if more failures are detected) at the end. This is because we can not assume all failures are going to be same.
The simple counter is a good fit if you believe that the common use case is a failure wherein most or all test cases fail for the same reason. If you believe that the common use case is a large number of different failures, then the formula may be more suitable.
I tend to think that most failures fall into the former class, so would suggest starting with the simple counter.
We had some more discussion and we were thinking that we should have 3 simple counters, hence summarizing them.
While we would want to save space by limiting the number of failure analysis data collected, we will also want to save time by stopping rest of the test cases when there are too many failures.We could use the same threshold for limiting saving the failure data and stop running the test cases but there may be cases where we may want to limit saving data but still want to run more test cases. So we may, at times, want different values for these thresholds.
These thresholds are at a test suite level. At a regression level, we may want to stop running the regression if there are too many failures. All these three can be understood and could be implemented at the same time, so broadening the scope of this discussion.
Have 3 Global parameters
TC_FAILURE_THRESHOLD <default - 5>- If number of failures in a test suite exceed this count, the execution of the test suite does not progress for the test suite
CUMULATIVE_TC_FAILURE_THRESHOLD <default - 50> - If the number of failures in the whole regression exceeds this count, the execution of the regression is stopped.
MAX_DIAG_THRESHOLD <default=TC_FAILURE_THRESHOLD> - The pbs_diag data is collected per failure. When the number of failure exceeds MAX_DIAG_THRESHOLD within a test suite, we shall stop collecting the diagnostic data.
TC_FAILURE_THRESHOLD & CUMULATIVE_TC_FAILURE_THRESHOLD is intended to save time in running regression, if there are too many failures at a test suite level or at an overall level respectively.
MAX_DIAG_THRESHOLD is intended to save space when there is too many failures in a build.
We should also have a command line option to benchpress to override these values.
example:
âtc_falure_threshold=20 if you want 20 as a threshold
âcumulative_tc_failure_threshold=150
â0â can be specified as the value, if you do not want any threshold to apply.
I have seen common failure is different types of failures. I wonât say large as I am not sure if it means all or more than 50% of tests. I still feel allowing % is a good way to define what do you mean by large.
Also if the use case is really to save space by avoiding saving data for similar failures then we might want to add more intelligence to our framework say by comparing the error message, if it is same for first 3 or so failures then stop.
I have updated design document as per our last discussion for three new option ââtc-failure-thresholdâ, ââcumulative-tc-failure-thresholdâ and ââmax-postdata-thresholdâ.
Please review it and let me know you comments/suggestion ASAP.
After taking pbs_diag output for core file, core file will be deleted from PBS_HOME directory
If this is a pbs_diag functionality, we can be specific that this is the behaviour of pbs_diag and not a change requested by this design.
âTestcases failure for this testsuite count exceeded testcase failure threshold (count)â
âTestcases failed for this testsuite exceeded the testcase failure threshold (count)â
if is not integer then PTL will bail out with error message âValue for testcase failure threshold should be integerâ.
A suggestion for the message reported âERROR: Invalid TESTCASE_FAILURE_THRESHOLD provided, please provide integer valueâ - similar changes can be considered for other messages defined later.
This file is nothing but PBS diag saved by pbs_diag command
pbs_diag generates â pbs_diag_yymmdd_hhmmss.tar.gzâ file . After running pbs_diag command does PTL rename pbs_diag output file to â PBS_hostname.tar.gzâ ?
Also PTL will run pbs_diag command with -c <path to core file> if all core files found in PBS_HOME directory
The output of above command will be store in <core file name>.out
After taking pbs_diag output for core file, core file will be deleted from PBS_HOME directory
-c is for ONLY the cpuset information
-g is for (-g core_file) core file
Will PTL exit with an error if user passes --tc_falure_threshold=5 (less than default value) as MAX_DIAG_THRESHOLD default is 10?
Hiren, latest design looks good. Thank you for making all the changes. I just have one minor comment. For ââmax-postdata-thresholdâ can you explicitly add that first N failures data will be saved. Though I assume it is the case but saying it explicitly would leave no further doubts.