PP-302: Implement save of PBS data for post-run analysis

hirenvadalia · December 8, 2016, 8:25am

Hi @developers,

There is new design document is available for review:
https://pbspro.atlassian.net/wiki/display/PD/PP-302%3A+Implement+save+of+PBS+data+for+post-run+analysis

Please have look and provide your comments/suggestion to community.

Thanks,
Hiren

prakashcv13 · December 9, 2016, 5:33am

Hiren,

Having the core file for post analysis may also help.

Thanks,
Prakash

PS: Please add a link on the design page to the community discussion.

hirenvadalia · December 9, 2016, 5:43am

@prakashcv13 AFAIK core file will be created in PBS_HOME/*_priv directory and PBS_HOME is already in post analysis data.
I think I had already add link to design page in this discussion see first post from me.

prakashcv13 · December 9, 2016, 5:46am

The request is for a link from the design page to this page and not vice versa .
And you are right about the PBS_HOME having the core file.

Thanks,
Prakash

hirenvadalia · December 9, 2016, 6:00am

@prakashcv13 Done. I have added link for this discussion in design doc.

crjayadev · December 13, 2016, 2:45am

Looks good. This data will be useful.

arungrover · December 13, 2016, 7:41pm

@hirenvadalia If PTL does not reset the log files present in PBS_HOME for each test case failure then wouldn’t it be sufficient to just have one copy of “PBS_<hostname>.tar.gz” for each test suite?

hirenvadalia · December 14, 2016, 1:56am

@arungrover Yes, PTL does not reset logs file and initially I had same thought of coping PBS_HOME tarball at test suite level but then I realize that almost every test case make changes to PBS configuration (ex. one test case test enables RR in sched config where another test case disables RR in sched config). So, in this case how we will know what was change/content in sched config from post analysis data if data is stored at test suite level. Also in test suite may contain single node as well as multi-node test cases so in this case single node test case will have only one tarball while multi-node test case will have multiple tarball based on number of nodes. So, after considering above cases I choose coping PBS_HOME tarball at test case level instead of test suite level.

arungrover · December 14, 2016, 3:19am

Well change is content of files like sched_config can be easily judged by looking at the failing test case and checking what modification it made to the config file.
Regarding taking backup for multi-node test cases… why don’t we consider the worst case scenario and store one backup from all the machines where these test cases might have run. If we do this, we can still related the failing test case with the logs.

What do you think?

hirenvadalia · December 14, 2016, 3:35am

@arungrover Well we can judge by looking at failing test cases but not always and not for all files, and specifically for database directory inside PBS_HOME.

Why we want to store data from machine where failing test case has not run?

I would prefer to test case level data saving instead test suite level data saving.

arungrover · December 14, 2016, 4:12am

Well since database and other files that are there, which ideally do not change from case to case in a suite, I feel it will have a lot of redundant data.

Now if we can design something which can smartly collect PBS_HOME from only machines where a test case has failed then that would be great. It will be ironical if we agree to make a copy of home for every failed test case but disagree on taking a copy of home from every machine if a test suite encounters any failure.

hirenvadalia · December 14, 2016, 5:15am

@arungrover Yes database will be definitely different from case to case as every test case has different config like jobs, resvs and nodes etc. so considering database directory I don’t see any redundant data.

AFAIK currently PTL does not have any interface which can tell exactly on which machine case is failed (I’m not sure but may be its possible to implement this but that will be like totally refactoring PTL as far as I have knowledge about PTL). If you have any idea on how to implement without refactoring PTL please do suggest I will love to implement that.

arungrover · December 14, 2016, 6:37am

Well the state of jobs/nodes is already what you store using pbsnodes, qmgr, qstat so db is not going to have anything different.
One example of redundancy will be if there is a product bug that makes a binary dump core. Now consider a test suite with 10 test cases. On failure of each test case we will store PBS_HOME and at the end of the test suite we will end up with tarballs consisting a total of 55 core files.

About the way to check for failure, I’d assume that it is the test case that fails. So when you know a test case has failed you store PBS_HOME from all the machines it was running on. If it is hard to identify then just collect logs from all the moms when the test suite finishes.

arungrover · December 14, 2016, 6:39am

while writing my previous comment I realized that we don’t collect anything for reservations. I think “pbs_rstat -f” should be sufficient to collect all reservation specific information.

ashwathraop · December 14, 2016, 6:55am

I agree with Arun here on taking datastore backup for every test case may not be worth it as you already have configurations, nodes, jobs and reservation details taken using commands. Datastore will have same data. Also when there are logs and configurations readily available I would prefer to refer them before getting into DB and start running queries.

hirenvadalia · December 14, 2016, 7:35am

@arungrover good catch I will update design document for “pbs_rstat -f” as well as for " qmgr -c ‘p h’ and qmgr -c ‘p pbshook’ " (Thanks to @ashwathraop for catching this)

hirenvadalia · December 14, 2016, 7:39am

I have updated design document as per my last comment.

kjakkali · December 14, 2016, 4:18pm

@hirenvadalia : Overall all design looks good.

PBS logs are the one which eats up disk space, so I would prefer to save logs only once per test suite instead of saving at test case level.
For cpuset systems we need to collect cpuset info (cpuset -s / -r) in case of failure.
In some cases we need job’s error file which exist in user’s home dir.
Instead of running list of pbs commands , can we make use of pbs_diag command?
If the test case fails at setup/teardown level do you save the post analysis data?

kjakkali · December 14, 2016, 4:23pm

@arungrover : We can make out configuration changes by logs and test script file but I feel it is an overhead. If we save it will make it easy to debug failure.

arungrover · December 14, 2016, 6:29pm

Well I think people who will be working on these issues reported by PTL can easily look at test case. But, still if you think it makes it easy then just copy config files in each test case and then have one single tar of PBS_HOME per testsuite

Topic		Replies	Views
Design for a supported way to change default setup in PTL Developers	49	2165	September 21, 2018
PTL performance test automation Developers	16	1077	January 31, 2019
PP-758: Add pbs_snapshot tool to capture state & logs from PBS Developers	88	6343	October 27, 2017
Update file format and file name for data saved using save_configuration() method in PTL Developers	23	1345	April 1, 2019
PP-1255: PBS Pro design document guidelines Developers	12	945	April 18, 2018

PP-302: Implement save of PBS data for post-run analysis

Related topics