Improve system and process monitoring for tests in PTL

Hi,

I am proposing to add few additional sensors to measure for system and process monitoring in PTL.
Please review the proposal and provide suggestions to improve test monitoring process for stress/load tests. https://pbspro.atlassian.net/wiki/spaces/PD/pages/1285947399/Improve+system+and+process+monitoring+for+tests+in+PTL

Regards,
Vishesh

I presume the test results are pushed to the disk periodically - after every test or set of tests. Long running tests are intended to run for several days. We should be able to review how the long running tests are progressing, in between.

Hi @crjayadev,

After the benchpress finishes running the tests, the monitored data is stored in the JSON file.
While running the tests, there is no way to know the data. I agree it is very important.
I propose we can have an additional output file where we would write after every interval so that the results can be viewed when the test is in progress.

Yes, having some means of intermediate reporting is critical for long running tests. We don’t want to run tests for several days and not have any results because the benchpress hanged, or some system issue. That would be bad :slight_smile:

Hi Vishesh,

Thank you for the EDD. I have few comments:

  1. Adding open_fds as a parameter for process monitoring might help.
  2. Will the results be added to other db types which PTL supports?
  3. If my understanding is correct, I see in your PR, there is flag which checks for benchpress not present in the process name. In a case, where I would want to use this API explicitly for monitoring benchpress process, I won’t be able to do that. Please confirm this.
  4. I am not sure about this but If we add the process start time at a testsuite level, it might help to make sure the daemon is not restarted in a case of long running tests. No need to explicitly use the ps command.

Hi @sujatapatnaik52,

Thanks for your comments.

  1. Adding open_fds as a parameter for process monitoring might help.
    Vishesh - > Will add it.

  2. Will the results be added to other db types which PTL supports?
    Vishesh-> No, for reporting of Performance/Load tests. We have chosen JSON. as it’s easily machine-readable and will, therefore be helpful for reporting purposes.
    The data itself is stored in dict in the framework. Can be stored in any db format.

  3. If my understanding is correct, I see in your PR, there is flag which checks for benchpress not present in the process name. In a case, where I would want to use this API explicitly for monitoring benchpress process, I won’t be able to do that. Please confirm this.

Vishesh-> I had put a condition to remove it because it was unnecessarily showing up because of the grep output. But now, I know the use-case. I will remove the condition, So benchpress will be always present by default.

  1. I am not sure about this but If we add the process start time at a testsuite level, it might help to make sure the daemon is not restarted in a case of long-running tests. No need to explicitly use the ps command.

Vishesh -> I don’t think I got you completely. When the daemon restarts the pid will change. Do we need process start time?

Thanks,
Vishesh

In a case where I don’t want to restart the daemon in a case of long running test like for months. In that case it would be good to have this option.