PP-758: Add pbs_snapshot tool to capture state & logs from PBS

Thanks for the heads-up Scott, I’ll request you to be one of my reviewers when I implement the changes :slight_smile:

I’ve updated the core file backtrace text in the doc, please let me know if you’d like to see any more changes to it, thanks!

@agrawalravi90 Please add a link to forum in design document…

Questions:
Isn’t is redundant information to collect output of “qstat -B” and output of “qstat -Bf”? output of “qstat -Bf” will suffice here, right?
same for output of “qstat -f”, output of “qstat -t”, output of “qstat -tf”, output of “qstat -x” and output of “qstat -xf”? output of “qstat -tf” (or “qstat -txf” if job history is enabled) will suffice here, right?
same for output of “qstat -fx -F dsv” and output of “qstat -f -F dsv”? output of “qstat -xf -F dsv” will suffice here, right?

Like pbs_diag, is pbs_snapshot is restricted to run only on machine where PBS Server is running?
Like pbs_diag, is pbs_snapshot is restricted to run only by root user?
Does pbs_snapshot collects any information from mom nodes? if Yes then please mentions same in DD and if No then we should collect at-least system info + mom_logs + core info (if any) from all (or just given) nodes. We can make this option by default disabled, and if user/admin want it can enabled this option by providing some switch.

Like pbs_diag, please include self logs into snapshot.
Why we are restricting list of attributes in --obfuscate option? let’s user specified this list.
If I understand -L option correctly then currently -L collects logs for + 1 days, right? If Yes then I fill we should collects logs only for given and if No, then can you please rephrase Interface to be more clear.

Thanks for your inputs Hiren.

Scott and Steve can explain better, but from what I gather, the information, redundant as it may be, when present in different formats sometimes helps debug issues. So, as long as the information is helpful in any way, as small as it may be, I think there’s no harm capturing it.

“Like pbs_diag, is pbs_snapshot is restricted to run only on machine where PBS Server is running?” No, please see the -H option in the design doc, you can provide the hostname to PBS Server while running it on another machine.

“Like pbs_diag, is pbs_snapshot is restricted to run only by root user?” You don’t need to be root, but you do need to have sudo privileges.

Regarding mom daemon information, the doc does mention that we’ll capture mom logs and core file information. Now that you mention it, I hadn’t thought about making mom logs optional, and I do think that’s worth discussing because each mom daemon creates its own logs, so they cumulatively can be huge. @scc What do you think? How does pbs_diag do it?
Hiren, for system information from the moms, could you please be more specific about what kind of information you’d like to be captured?

“Like pbs_diag, please include self logs into snapshot.”: It’s already there right? please see the ‘-l’ option to pbs_snapshot in the doc

“Why we are restricting list of attributes in --obfuscate option? let’s user specified this list.” Actually we don’t want to restrict --obfuscate, we want it to cover everything that the users might consider sensitive. In fact, if required, I think we should instead have an option called “–dont-obfuscate” to let us know if we should not obfuscate something, as that list will probably be smaller. What do you think?

About the -L option, I’m not sure that I completely understand what you meant, could you please go through the interface documentation for -L option and let me know what you’d like to change?

Thanks for the feedback!

@scc and @sgombosi can you please provide more information on why we need all these redundant information in snapshot?

Sorry @agrawalravi90 but I don’t agree with this because along with information, size of final snapshot also matters and AFAIK size is the reason we are not including whole core files in current pbs_diag’s output tarball.

Please mention this in design document.

Whatever system section in design document mentions, all those information should be captured from all machine.
Also please include ‘vmstat’, ‘df -h’ (see PP-609) and capturing system logs in ‘system’ section in design document.

I mean capture those logs in some file and include that file in final snapshot tarball. see ‘diag_log’ file in current pbs_diag’s output.
Also add missing log levels in -l interface document. see https://github.com/PBSPro/pbspro/blob/master/test/fw/ptl/utils/pbs_cliutils.py#L50 for list of allowed log levels in PTL.
Also I think INFOCLI2 should be default log level instead of ERROR.

I see, we can have that list which we can call it default obfuscate list but still I think we should allow user to override that list.
Assume that I have one custom resource/attribute which contains sensitive information, so now while taking snapshot I need to obfuscate that, how can I do it?

I think we can have both option and let user decide which one to use.

See 2nd and 3rd usage in Sample Usage, where you mentioned -L 10 in command and in description you are saying that “along with 11 days of logs going back from the present day (including the present day’s logs)”,
Means -L will store given + 1 day right? so my question is why ‘+ 1 day’ ?
When I say -L 10 mean I want to capture 10 days of logs starting from current day.

Could you please be more specific about system logs? Did you mean syslog? Would the output of ‘dmesg’ suffice for this?

Why?

We have been talking about obfuscating all custom resources by default, but this would probably be a future enhancement to ‘–obfuscate’ and may not be part of the first version, unless it is deemed critical.

Ah, I see what you mean. Personally, I feel that when a user says -L 1, they could be thinking “capture all logs generated in the last 24 hours” and not just “capture all logs generated today”, so you can’t be sure. And capturing an additional day’s logs doesn’t hurt. But if others agree that -L 1 is pretty clear in stating “capture only today’s logs”, then I’ll make it so.

Thanks for the feedback!

I’ve now changed the interface a bit so that mom and comm logs will be captured only if explicitly instructed. I’ve also taken in Hiren’s comments to make some more changes to the doc, I request you all to review the changes and provide further feedback. Thanks!

If no --add-comm-logs= is used but there are comm logs on the local host will they be collected even without this option being used (for example, if pbs_snapshot is run on the PBS server host itself)? I strongly believe they should be.

With a diagnostic tool, because you don’t know exactly what you might need to look at to solve a mystery, it is better to collect possibly redundant information as seen from different sources/perspectives. You may not have a chance to collect the same data later if it is needed, and this minimizes back and forth to the customer to collect more data when it turns out it is needed.

When a support engineer looks at pbs_diag output to try to figure out what is going on they will want to essentially use the same tools that the actual admin can use to look at the system. They may want to look at what qstat -ns is showing to get an overview, then dive into qstat -f output for specific jobs based on what they see there. So essentially even though the information in qstat -ns is a subset of qstat -f, we collect it in different formats for the diag for the same reason we provide the qstat different qstat command options in the first place (same with the various pbsnodes information we collect). People want to see the information in different formats, and for the diagnostic use case of pbs_snapshot the main consumer of the data is people.

I do think we could live without qstat -f output and just collect qstat -xf output since that is TRULY a subset, rather than differently formatted. On the other hand, for large systems qstat -xf can be many times larger than qstat -f, and if the problem being diagnosed does not relate to finished job, the qstat -f file is likely to be many times smaller than qstat -f and more manageable in a text editor. And since it is a true subset, it’ll compress nicely, let’s collect it.

If a daemon crashed happened as a result of a memory leak, the core files can be really really big (in a different class of “big” than qstat -xf mentioned above). Yes, they’d usually compress nicely, but they can be a burden to uncompress. As previously stated, in the past we have more often than not NOT needed to collect the core files themselves, the stack trace is often good enough. Also, obfuscating the contents of the core file is at best impractical. File size is just one consideration here.

Oh, ok. How about this: we will by default only capture all the logs available on the host which is running PBS Server. Then, an additional option ‘–additional-hosts=’ will cause pbs_snapshot to capture all logs from the hostname specified (which could be a comma separated list, or blank, in which case pbs_snapshot will capture logs from all hosts running PBS daemons) ?

Yes, that sounds good to me. Thanks.

Alright, I’ve modified the doc to reflect that, please provide further feedback. Thanks!

Yes I mean syslog only. I think dmesg will suffice here.

Why ERROR?
And INFOCLI2 will give little verbose output about what’s happen when you run this command.

  • AFAIK all PTL command has INFOCLI2 as default log level (see pbs_benchpress as an example)

Why all? Why not give option to choose user which resource to obfuscate? There are use cases where I don’t want to obfuscate any of the data which this tool collects, and one of them is collecting snapshot of PBS whenever test failure happens in PTL.
Basically in my view obfuscating will only make sense when you are collecting any data from customer sites. If you are collecting any data from internal systems then obfuscating data doesn’t make sense and I am pretty sure that this tool (or supported utilities of this tool) will be used in-house also.

@scc Thanks for explanation, I understand your point.
Still I feel redundant information doesn’t make any sense to me, but I leave it to you and other reviewers to decide whether we need it or not. :slight_smile:

The word snapshot is associated with timestamp. “Snapshot is taken at xxxx time”. May be recording start time and end time of snapshot would be good. This will set the expectation to the end user about missing information.

Also, here snapshot is set of commands in python script. Two snapshot script can be executed in parallel. Do u want to you document on behavior when two snapshots ran simultaneously?

Thinking about this more, I’ve now removed ‘cmds.err’ and changed the default level to INFOCLI2, it doesn’t make sense to capture pbs commands run separately if we are going to capture pbs_snapshot’s log anyways, so might we well just capture a more verbose pbs_snapshot log.

Well, if you don’t want anything obfuscated, you could just not include the ‘–obfuscate’ flag and it won’t do any obfuscation whatsoever :slight_smile:
Specifically about custom resources, as I said, we are not planning to implement that right now, so when we do decide to put it in, we can discuss how we should do it (whether obfuscate all if --obfuscate is given, or ask the user to explicitly list them out).

Please let me know if you have any other concerns. Thanks!

Thanks for the inputs Vinod. We are going to generate and store the logs from pbs_snapshot’s run inside the snapshot itself as ‘pbs_snapshot.log’. Do you think that suffices for them to look at the start and end time of the snapshot?

If two snapshots are run in parallel then you’ll get two unique snapshots, they wouldn’t interfere with one another any more than PBS commands being run by other programs/users would. So, I personally don’t think it needs to be documented, but please let me know if you feel differently.

@scc, @hirenvadalia guys any more feedback on the design ? Or does it look good to go?

Hi Ravi, since we recently fixed the bug which was causing qmgr “print server” and “print sched” to output read-only attributes I’d like to add collection of “qmgr list sched”. If we wanted to get rid of qmgr_psched.out I’d be ok with that at present, but that seems like something that we’d want to collect once multiple sched objects can exist in qmgr (assuming that command prints them all).

I do not believe we need to add “qmgr list server” since I think that will always be the same as “qstat -Bf”, and I don’t believe we need to add “qmgr list node @default” since I think it will always be the same as “pbsnodes -av”.

I think that is the end of my feedback on the EDD. Thanks for doing this!

Thanks Scott, I’ve replaced qmgr print sched with list. I think we should add in print sched if it becomes relevant after multi-sched and wait until then, but please let me know if you’d like it in now.

Thanks for all the feedback! Please do provide your sign-off if it looks good to go.

Hey @agrawalravi90
I think pbs_snapshot.log is not optional right? because if user does not provide -l on command line then there is still default value for it, so in any case pbs_snapshot.log will be saved.

When you say ‘sudo pbs_snapshot’ means pbs_snapshot will execute as root right?
I understand need of sudo (or root) privileges but there are two ways to get it… one you run pbs_snapshot using sudo (aka as root user) and pbs_snapshot will execute internal commands as current user whenever you need sudo priv… and another way is that you run pbs_snapshot as normal user and internal commands will be executed using sudo whenever sudo priv is required…
Now which way we are going? and why?