Capture "sub" snapshots for pbs_snapshot --additional-hosts

agrawalravi90 · September 14, 2018, 6:22am

Hi,

I’m proposing a design change for pbs_snapshot’s “–additional-hosts” option. Right now, when given a list of hosts, pbs_snapshot captures things like comm_logs, mom_logs, mom_priv and some system commands from each of those hosts (whatever can be captured). This has 2 disadvantages:

It incurs a lot of data being copied over the network from various hosts to the machine where pbs_snapshot was run.
copying protected data (e.g - mom_priv) from remote hosts is very tricky if running pbs_snapshot as a non-root user using the new “–with-sudo” option.

So, I propose the following:

When capturing data from remote hosts, pbs_snapshot will run pbs_snapshot directly on the remote hosts.
Once the “sub” snapshots are captured completely, the main pbs_snapshot program will copy over the compressed snapshot tarballs as _snapshot.tgz and keep them in the first level of the main snapshot that’s being captured.

Please let me know what you guys think. @scc, requesting you specifically to provide some feedback.

Thanks!

agrawalravi90 · September 17, 2018, 9:38pm

Gentle reminder, @scc could you please review this proposal?

@bhroam and @arungrover, requesting you guys to also review this since we came up with this together. Thanks!

arungrover · September 18, 2018, 4:12pm

Sounds goo to me @agrawalravi90

scc · September 18, 2018, 4:58pm

@agrawalravi90, I like this idea, but have a few questions:

The current pbs_snapshot EDD does not define what the snapshot will contain if pbs_snapshot is run directly on an execution host, and in fact snapshot tries to collect all of the same information as though it had been run on the server itself. This is fine for when it is invoked normally, but would lead to lots of redundant server queries (pbsnodes, qstat, etc.) that are already being collected by the top level snapshot invocation. Did you all discuss introducing a new mode of operation that would skip these queries and only collect the relevant local information (logs and all v1 and v2 configuration files, essentially)?
What will the snapshots be named? I think the “sub” snapshots should contain the host name from which they cam in the .tgz filename.

agrawalravi90 · September 18, 2018, 9:39pm

Thanks for your feedback Scott. We didn’t discuss adding a separate mode for capturing data from non-PBS server nodes, I was just thinking that we’ll call it with -H <hostname> set to the mom host, so it will issue qstat etc., but won’t get anything as the host is a mom/comm. But we should probably add a new mode to pbs_snapshot instead as that’ll be more elegant and reliable.

About what the “sub” snapshots will be named, yes, we were thinking the same thing, to name them by their host’s hostname.

agrawalravi90 · September 19, 2018, 8:51pm

I created a design document for this: https://pbspro.atlassian.net/wiki/spaces/PD/pages/718635011/Enhance+pbs+snapshot+to+capture+remote+host+data+as+sub+snapshots

Please review it and provide feedback. Thanks!

scc · September 20, 2018, 11:47pm

Thanks @agrawalravi90, looks good!

arungrover · September 21, 2018, 6:14pm

@agrawalravi90 I think we should introduce a new argument to snapshot for capturing only mom related data. There could be a use case where one would want to capture snapshot from mom node only (along with server queries). If you change ‘-H’ to match hostname with server name to issue additional server queries then one has to run the snapshot from server only to get server related data.

How about when additional hosts is issued we run snapshot on these hosts with this new argument which will capture information only from that host?

agrawalravi90 · September 21, 2018, 6:25pm

I think I didn’t understand you completely. If -H argument is a valid PBS server, then all of the pbs commands will be executed, if it is not a valid pbs server, then we won’t. So, even with this change, somebody can run pbs_snapshot remotely, they won’t have to run it from the server node, right?

arungrover · September 21, 2018, 6:30pm

ah… got it, didn’t realize it takes server name as an argument. I guess what you have will work.
Sorry about the confusion.

anamika · September 21, 2018, 7:14pm

Design doc looks good to me. Thanks for coming up with that.

agrawalravi90 · September 21, 2018, 8:19pm

Thanks guys, seems like there’s enough consensus, so I’ll go ahead and implement the code.

agrawalravi90 · September 27, 2018, 7:21am

Hey guys, while implementing the code for this, I realized that I had to make another, hopefully smallish, interface change. I’ve updated the EDD to reflect it, but here’s the text for quick reference:

Primary host captured will now be local host by default:

Earlier, pbs_snapshot would actually parse the pbs.conf on the local host, find the pbspro server host and capture that. Now, pbs_snapshot will capture the local host by default. The -H option should be used to point to the remote pbs server if pbs_snapshot is invoked from a client host. This is needed to prevent the child pbs_snapshot invocations from capturing data from the main pbs server host.

Please let me know if you guys have any concerns regarding this change. Thanks and sorry that I didn’t think about this before.

agrawalravi90 · September 27, 2018, 7:21am

btw, here’s the PR: https://github.com/PBSPro/pbspro/pull/825

Topic		Replies	Views
PP-758: Add pbs_snapshot tool to capture state & logs from PBS Developers	88	6335	October 27, 2017
Running pbs_snapshot with sudo Developers	33	3526	September 15, 2018
Pbs_snapshot --basic to capture minimal useful data Developers	7	671	March 24, 2021
Enhance pbs_config & pbs_stat to support pbs_snapshot Developers	5	719	May 17, 2019
Pbs_snapshot: Add the ability to obfuscate existing snapshots Developers	1	376	April 16, 2021

Capture "sub" snapshots for pbs_snapshot --additional-hosts

Related topics