PP-758: Add pbs_snapshot tool to capture state & logs from PBS

Hi,

This is to inform the community about a new tool ‘pbs_snapshot’ to capture the state and logs from a PBS system. This tool is meant to replace the script pbs_diag which is currently the only means to capture PBS data.

Some details:
“pbs_snapshot” will be written in Python and will make use of PTL libraries to interact with the PBS system that it is capturing. This will mean that any major changes to PBS will need very minor (if any) refactoring of pbs_snapshot as PTL gets updated in tandem with PBS now, so pbs_snapshot will automatically work with the latest version of PBSPro.
Also, a new set of utilities (PBSSnapUtils) will be added to PTL itself for this tool, which will be directly available for PTL test writers and developers to write PTL tests/debugging tools which may need the ability to take snapshots of PBS.
The tool will also come with the ability to anonymize/obfuscate PBS data to enable users with sensitive data to obfuscate and share snapshots for bug reporting and debugging.

Here’s the design document for it:
https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=51614810

Please provide your valuable feedback!

Thanks,
Ravi

I think this does exactly what you intended it to do: provide input for the simulator from a functional PBS installation. That’s not the same as replacing pbs_diag as a diagnostic tool for support:

  • No pbs_comm logs
  • No mom logs
  • No Postgresql logs
  • No pbs_hostn output
  • No pbs_probe output
  • No mom_priv capture
  • no “ps” output so we can see what else is happening on the server that might be interfering with PBS (or even see if all the daemons are actually up)
  • server_priv output doesn’t appear to capture everything (no subdirectories, e.g. hooks, tmp, topology, etc.). We see lots of ill-conceived or broken hooks out there in customer-land.
  • no uname output (it’s surprisingly helpful to know what OS/release a site is running)
  • will it capture core files in the various *_priv directories?
  • Does it capture finished jobs? Sometimes it’s helpful to know there are a half-million jobs in F state gumming up the works
  • I know the various “qstat” commands captured by pbs_diag might be redundant (and indeed they are, for your purposes), but sometimes it’s handy to just be able to do a “wc -l” and get a quick idea of how many jobs are in the system before you start looking at things in detail.
  • output (I just ran the beta Jon sent out to one of my customers) is a directory tree, not a single .tgz file that a site can attach to a ticket (I know that sounds silly, but you’d be surprised how many smaller sites don’t have any personnel who are *nix-literate - there are some sites where it would take an hour to explain to the site contact how to do a “tar”, and they’d probably still screw it up).

There’s additional stuff I wish pbs_diag captured (lsof for the daemons, “ps -aux” for the daemons so we can see what the process state is, a copy of nsswitch.conf and /etc/hosts since name resolution issues are the worst things we have to deal with, probably some other stuff that’s slipped my mind), but in general it captures a lot of useful information in a very compact format. Please don’t take it away.

Hello @sgombosi, you are correct that there is not a one to one match in the functionality at present. I aim to identify all of the gaps between what pbs_diag collects and what is being proposed here and give that as feedback on this proposal, though you beat me to much of it :slight_smile: .

There are also the other modes of operation of pbs_diag as well, off the top of my head there is “gather stack trace info from a core file”, “gather information about a misbehaving process”, and “gather information about cpusets”. I plan to file separate tickets for those, but it is still relevant to the discussion here (though for the last one it’ll be focused on cgroups rather than cpusets).

This is a list of the core PBS Pro information that pbs_diag collects in it’s default information gathering mode, but which are missing from this EDD which proposes to replace it. There IS some redundancy in what gets captured by pbs_diag, but it has proven to be useful to have the same information available in different formats when investigating problems:

pbs_probe -v
qmgr -c "print server"
qmgr -c "print node @default"
qmgr -c "p h @default"
qmgr -c "list pbshook"
pbs_hostn -v $(hostname)
pbs_rstat
qstat -t
qstat -tf
qstat -x
qstat -xf
qstat -ns
pbsnodes -a
pbsnodes -avSj
pbsnodes -aSj
pbsnodes -avS
pbsnodes -aS
pbsnodes -aFdsv
pbsnodes -avFdsv
capture $PBS_HOME/pbs_environment
capture the local pbs_comm logs for the same days as server and sched logs are collected for
dump all of the mom’s vnode defs into a file
capture the entire $PBS_HOME/datastore/pg_log/ directory
captures these files from the local mom_priv directory: config prologue epilogue mom.lock
pbs_diag logs the stderr of all of the commands it runs in case the person examining the diag needs to know what (if anything) was written there.

Current pbs_diag does NOT capture the following core product information, but they have been on my list to add and I’d like to see them in pbs_snapshot (the qstat options are only just now being merged into the product) :

capture the local pbs_mom logs for the same days as server and sched (and comm) logs are collected for
qmgr -c "p r"
qstat -fx -F dsv
qstat -f -F dsv

These are Linux specific things that pbs_diag currently does in configuration gathering mode (Windows alternatives should be added f available, pbs_diag is Linux only):

cat /etc/release
ps -ef | grep pbs | grep -v grep
uname -a
tar.gz the output

These would be new but useful Linux specific operations that would be nice to have (Windows alternatives should be added if available, pbs_diag is Linux only):

lsof | grep pbs
ps -aux | grep pbs | grep -v grep
capture /etc/hosts
capture /etc/nsswitch.conf

I am ignoring the extra feature of pbs_diag where one can specify particular jobs to gather information on. All that really does is provide tracejob output for the job, qstat -f output for that job alone, pbs_dtj output for the job, and make sure that the log files covering the lifetime of the job got copied into the diag. I think this is safe to not implement in a replacement utility. tracejob adds nothing beyond the logs, qstat -f is provided elsewhere, pbs_dtj is not a supported tool and should not be used from pbs_snapshot, and the log copies can be covered with the -L option.

@sgombosi, your previous note says:
"server_priv output doesn’t appear to capture everything (no subdirectories, e.g. hooks, tmp, topology, etc.). "

but the EDD says:
server_priv sub-directory: a copy of the ‘server_priv’ directory inside PBS_HOME

So we can expect the pbs_snapshot that gets added to the product to do a complete capture of server_priv.

Thanks a lot for your replies and all the specific details guys! I’ll change the design to enhance pbs_snapshot with all the suggestions,

Just one clarification Scott, you mentioned the following:
cat /etc/release
ps -ef | grep pbs | grep -v grep
uname -a
tar.gz the output

When you say tar.gz the output, did you mean to say that we should create a tarball for the whole snapshot, or did you mean to say that we should package the specific information mentioned here (starting with cat)?

Thanks again for all the feedback!

One more thing guys, seeing as how this is a lot of information and parts of it are redundant, do you think it would be valuable to have two levels in pbs_snapshot for filtering information? So, something like “–verbose” would print out everything, but by default (or with a --compact flag) it would not print out redundant information/print out only the most commonly used information? Or do you think it’s always better to have everything when you do diagnostics?

Thanks!

I’ve updated the design doc to capture the extra information that you guys mentioned was needed. I’ve also changed the format of the snapshot to organize things better. Please let me know what you think. Thanks!

Hi Ravi, thanks for the updates!

I meant tar/gzip the whole snapshot.

I think it is better to grab everything all the time. You often don’t know at the outset what you want to look at when troubleshooting a problem, so gathering more is helpful (and if it is redundant, it’ll compress nicely anyway).

Gathering “more” likely makes the job of the --obfuscate flag more difficult, though, since there are now more output formats to consider. I’d be open to discussing possibly grabbing just a core set of information if --obfuscate is used if you feel it greatly simplifies things on the implementation end.

The pbs_diag tool never actually copied the core files themselves historically, but it does have a -g option such that if you point it at a core file it’ll figure out which daemon produced it, load it up in gdb, and run “bt” (though I should have done “bt full”…) on each of the threads and capture the results. In my experience this is more useful than capturing the core file itself, at least at the initial stage of an investigation, and certainly a lot smaller size-wise than the potentially very large core file. I’d suggest that instead of gathering the core file at all, pbs_snapshot should “bt full” the threads for any core file(s) it finds, capture the output, and try to copy the relevant daemon log from that day. If support needs the core file itself they can obtain it separately.

And speaking of --obfuscate and core files… if we do go with pbs_snapshot collecting the core file itself in standard operation, then the utility should NOT collect any core files if --obfuscate is used (unless you want to clean them out, which I doubt you’d want to, and they’d be useless if you tried). If we collect stack trace info there could also be needed obfuscation work in the output, but at least we are just talking about text at that point.

An old trick for avoiding the second call to grep…

ps -ef | grep [p]bs

1 Like

Just curious, why collect the mom.lock file? It just contains the pid of the currently running mom. You’re not collecting the other lock files.

Bhroam

Hi Bhroam, the other lock files would be collected under server_priv and sched_priv as the EDD is currently written. I’d suggest re-writing the “mom/” line to match the others and just grab everything in mom_priv/ on the host where pbs_snapshot is run (hooks/, jobs/, config.d/, etc.).

I just don’t see how the lock files are helpful. The ps -ef | grep pbs output is already being gathered, so getting the pid in a different way doesn’t seem useful to me. Now if you’re just collecting everything from the priv directories, it makes more sense that you don’t want to write extra code to avoid collecting it. It doesn’t hurt.

A note that if you collect the entire priv directories, you will collect the cores.

Thanks for the replies guys.
@scc are you sure about not capturing the core files themselves? In your experience, have we survived without core files from users (with just the help of “bt all”), or did we eventually ask them to provide us the core file?

About the lock files, unless we unanimously agree that they aren’t useful, I think it’s better to capture them. We are capturing redundant information in other forms as well (e.g:- qstat & qstat -f), but if they are going to help debug the issue faster, I think there’s no harm adding them in.

About the ‘–obfuscate’ option, I’ve now mentioned that core files won’t be captured with --obfuscate, thanks for pointing that out. What about the ‘system’ and ‘pbs’ directories, should we also not capture them when using --obfuscate? And I’m not sure whether the different formats will cause the effort to multiply or not as the data will be captured via PTL interfaces, so it might just involve obfuscating it once and then displaying the same information in different formats. So, maybe we proceed with it as it is for now and if during implementation we find that the effort is huge, we can file separate tickets for enhancing --obfuscate to cover all of the information later?

I made some minor changes to the format of the snapshot and the details, please review the change and let me know if it looks ok. Thanks!

Hi Ravi, in 10+ years of using pbs_diag I have never considered nor been asked to have it actually collect the core files themselves. In many cases the trace info is good enough since it likely ties the incident to a known problem. There are of course cases where getting the actual core files from a customer is necessary, but that is the exception rather than the rule, and given that they may be large and/or numerous (though hopefully not!) and don’t play well with obfuscation, I’d recommend simply not collecting them. @sgombosi, what do you think?

The lock files have historically been helpful on rare occasions. If the problem at hand has to do with status reporting from the init script or possibly failover then checking the PID in the lock file against ps may be enlightening. I’d like to collect them.

As for the system/ director and --obfuscate, I’d like to try to collect everything in there, but it is a can of worms. As currently written, the EDD shows this:

system/
-    os_info: Information about the OS: version, flavour of linux etc. (output of "uname -a" and "cat /etc/*release*" for linux)
-    process_info: List of processes running on the system when the snapshot was taken (output of "ps -ef | grep pbs | grep -v grep" for linux)
-    lsof_pbs.out: output of "lsof | grep pbs | grep -v grep", only on linux systems
-    ps_aux_pbs.out: output of "ps -aux | grep pbs | grep -v grep", only on linux systems
-    etc_hosts: Copy of "/etc/hosts" file, only on linux systems.
-    etc_nsswitch_conf: Copy of "/etc//nsswitch.conf" file, only on linux systems.

And --obfuscate is listed as:

--obfuscate: Obfuscates euser, egroup, project, account_name
                  Deletes mail endpoints, owner, managers, operators, variable_list
                  ACLs, group_list, job name, jobdir

So --obfuscate does not appear to obfuscate any hostnames (nor ip addresses), which may be problematic for some sites. I think we need to discuss the possibility of having --obfuscate obfuscate (and map) hostnames and possibly ip addresses wherever they appear in the snapshot (job IDs, logs, pbs.conf, etc.) before we decide on what needs obfuscating in the system/ directory, since hostnames are what we’d primarily be talking about, along with the name of the PBS dataservice user. What are your thoughts on having --obfuscate handle hostnames wherever possible?

Thanks again!

Alright, you’ve convinced me that collecting a “bt all” from the core files suffices,

I like the idea of obfuscating hostnames and ip addresses, I’ve added that to --obfuscate’s description. I’ve also added PBS dataservice username. Anything else you think we need to be concerned about?

Please review the changes and provide further feedback. Thanks!

Great! So a few details about core_file_bt/:

  1. It needs to be “bt full”, not “bt all”, I think. Unless I am misinterpreting, “bt all” is not valid and gives me “(gdb) No symbol “all” in current context.”

  2. It needs to go through and obtain a “bt full” for all of the threads. What the current pbs_diag -g command does is an initial “bt” command in gdb for the purpose of simply counting how many "[New " strings appear (IIRC this was because gdb would sometimes print “[New LWP …” and other times “[New Thread …”), then in a new gdb session it loops through those threads running the following gdb commands and this is what it actually collects (in this example there were 3 threads):

    info threads
    thread 1
    bt
    thread 2
    bt
    thread 3
    bt

(of course, we now want “bt” to be “bt full”)

  1. Just advice, really, but te careful to distinguish pbs_server.bin from pbs_comm core files since you need to point gdb at the proper binary. I mention those two specifically because pbs_comm core files would usually be found in server_priv, same as pbs_server.bin core files, so relying on the directory it is found in is not sufficient. pbs_diag uses the file command to determine the daemon:

    [root@centos7 tmp]# file /var/spool/pbs/server_priv/core.3053
    /var/spool/pbs/server_priv/core.3053: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from ‘/opt/pbs/sbin/pbs_server.bin’
    [root@centos7 tmp]# file /var/spool/pbs/server_priv/core.2175
    /var/spool/pbs/server_priv/core.2175: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from ‘/opt/pbs/sbin/pbs_comm’

For the --obfuscate definition, I think it needs to explicitly say that it obfuscates all user names, not just “euser, PBS dataservice username, owner, managers, operators”. What about the user account names that appear in the server log messages like?

Type 0 request received from user1@centos7.prog.altair.com

If the EDD were made more generic to say “any non-root user account name” it would be simpler and cover all of the necessary cases.

Thanks for the details and the advice Scott.

For capturing the backtrace of all threads, how about using “thread apply all backtrace” ?(https://sourceware.org/gdb/onlinedocs/gdb/Backtrace.html)

And yes, I did have it in the back of my mind that server_priv also contains core files from comm, maybe we should rename the core_files_bt/(sched_priv, server_priv, mom_priv) directories to just core_files_bt/(server, comm, sched, mom) (separating out the comm core file backtraces), what do you think?

Hi Ravi, a slight modification to “thread apply all backtrace full” (added full) does appear to produce all of the information we are looking for.

Also, one more little thing to look out for… the core files may have one of two formats (or more, other Linux distros may have a different default naming convention to begin with) since the pbs init script renames them in check_core():

[root@centos7 tmp]# ls -lrt /var/spool/pbs/server_priv/core*
-rw-------. 1 root root 77881344 May  4 09:24 /var/spool/pbs/server_priv/core_0002
-rw-------. 1 root root 68612096 May  4 09:46 /var/spool/pbs/server_priv/core_0001
-rw-------. 1 root root 70922240 May  4 11:26 /var/spool/pbs/server_priv/core.14443
[root@centos7 tmp]# /etc/init.d/pbs start
Starting PBS
Warning: PBS Professional has detected core file(s) in PBS_HOME that require attention!!!
Warning: Please inform your administrator immediately or contact Altair customer support
PBS comm already running.
PBS scheduler already running.
Connecting to PBS dataservice....connected to PBS dataservice@centos7
Using license server at 6200@trlicsrv03
PBS server
[root@centos7 tmp]# ls -lrt /var/spool/pbs/server_priv/core*
-rw-------. 1 root root 77881344 May  4 09:24 /var/spool/pbs/server_priv/core_0002
-rw-------. 1 root root 68612096 May  4 09:46 /var/spool/pbs/server_priv/core_0001
-rw-------. 1 root root 70922240 May  4 11:26 /var/spool/pbs/server_priv/core_0003
n).  

Checking for core* should be fine though. Just FYI.