Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Samply reports different results based on perf.data compared to perf or Inferno #290

Closed
marxin opened this issue Jul 12, 2024 · 13 comments
Closed

Comments

@marxin
Copy link

marxin commented Jul 12, 2024

I've got an application that scans RPM files and builds build-id debuginfod index of that. If I run perf, I've got the following profile (the results are expected based on what the program does):

$ perf record -g --call-graph dwarf target/release/debuginfod-rs /home/marxin/Downloads/RPM
...
$ perf report --no-children
+   12.07%  debuginfod-rs    libc.so.6             [.] __memmove_avx512_unaligned_erms
+    9.40%  debuginfod-rs    debuginfod-rs         [.] rpm::rpm::headers::header::Header<T>::parse
+    7.90%  debuginfod-rs    libc.so.6             [.] _int_free
+    7.12%  debuginfod-rs    libc.so.6             [.] malloc
+    7.10%  debuginfod-rs    debuginfod-rs         [.] <core::str::lossy::Utf8Chunks as core::iter::traits::iterator::Iterator>::next
+    4.82%  debuginfod-rs    debuginfod-rs         [.] rpm::rpm::package::PackageMetadata::get_file_entries
+    4.45%  debuginfod-rs    debuginfod-rs         [.] rpm::rpm::headers::header::Header<rpm::constants::IndexSignatureTag>::parse_signature
+    4.42%  debuginfod-rs    libc.so.6             [.] malloc_consolidate
+    4.11%  debuginfod-rs    libc.so.6             [.] unlink_chunk.isra.0
+    2.30%  debuginfod-rs    libc.so.6             [.] cfree@GLIBC_2.2.5
+    2.15%  debuginfod-rs    debuginfod-rs         [.] alloc::string::String::from_utf8_lossy
+    2.03%  debuginfod-rs    libc.so.6             [.] __memset_avx512_unaligned_erms

If I run Inferno, I get the following Flamegraph (note I collapse the call stacks as the application uses Rayon):

perf script | inferno-collapse-perf --skip-after=debuginfod_rs::Server::analyze_file | inferno-flamegraph --minwidth=1
debuginfod

However, if I try Samply, I get the following Flamegraph:
Screenshot from 2024-07-12 20-20-24

Most of the profile is taken by functions (e.g. std::sys::pal::unix::fs::stat 22%) which I cannot see in the Perf report or Inferno flamegraph).

Can you please take a look?

Needed files:
https://splichal.eu/tmp/debuginfod-rs.zst
https://splichal.eu/tmp/perf.data.zst

@mstange
Copy link
Owner

mstange commented Jul 12, 2024

https://splichal.eu/tmp/perf.data.zst gives me an "Access forbidden!" error.

@mstange
Copy link
Owner

mstange commented Jul 12, 2024

Likely what's happening is that samply is just giving you much more detailed information, broken down by thread.

In the samply flamegraph, at the bottom you can see debuginfo_rs::main calling debuginfo_rs::Server::walk. I can find those two functions in the SVG flamegraph on the very left end (a small red stack). The rest of the work shown in the SVG flamegraph is probably happening on different threads, which you can switch through individually in the Firefox Profiler UI.

The SVG flame graph says there were 3598 samples inside debuginfo_rs::Server::walk. The SVG gives you just a single small box for those 3598 samples. samply and the Firefox Profiler break those 3598 samples down into all the detail from your samply screenshot. A lot of that detail is probably due to having inline stacks. Also, I think inferno throws out boxes that would be smaller than a certain percentage.

@marxin
Copy link
Author

marxin commented Jul 12, 2024

https://splichal.eu/tmp/perf.data.zst gives me an "Access forbidden!" error.

Please try it now! I've used scp, but forgot to use chown.

@marxin
Copy link
Author

marxin commented Jul 12, 2024

In the samply flamegraph, at the bottom you can see debuginfo_rs::main calling debuginfo_rs::Server::walk. I can find those two functions in the SVG flamegraph on the very left end (a small red stack). The rest of the work shown in the SVG flamegraph is probably happening on different threads, which you can switch through individually in the Firefox Profiler UI.

Ok, I've just tried to select one thread and filter for parse_header function. I can confirm I can find it:

Screenshot from 2024-07-12 20-55-17

But it seems to suffer from the very same problem as Inferno. Even for a simple thread, the number of call stacks leading to the filtered function is enormous and one gets a gazillion of vertical lines, each representing its stack.

What's also interesting is that if I select a single thread, I can't see any value in the Self column:

Screenshot from 2024-07-12 20-59-51

Is it correct?

Based on the files I provides, can you open the perf profile on your box?

@mstange
Copy link
Owner

mstange commented Jul 12, 2024

The flame graph and call tree visualizations aren't a great fit for workloads which make a lot of use of recursion. Rayon uses recursion extensively. I agree it makes for a very messy flame graph.
You can use the "collapse recursion" transform from the right-click menu to improve things a little bit. But I agree it's still quite messy. I had more success once I right-clicked debuginfod_rs::Server::walk::{{closure}} and chose "Focus on function".

A better view for these types of profiles would be a "top functions" view (firefox-devtools/profiler#15 is filed about this).

@mstange
Copy link
Owner

mstange commented Jul 13, 2024

Ah, I see your use of --skip-after=debuginfod_rs::Server::analyze_file now. The "Focus on function" transform is similar to that. The difference is that "Focus on function" will also throw out any stacks that don't include the selected function, whereas --skip-after keeps them.

@mstange
Copy link
Owner

mstange commented Jul 13, 2024

What's also interesting is that if I select a single thread, I can't see any value in the Self column:

Is it correct?

Yes, it is correct. If you drill down deeper into the stack you will eventually end up at call nodes with non-zero self times. See https://profiler.firefox.com/docs/#/./guide-stack-samples-and-call-trees?id=self-time-in-the-call-tree for a more detailed explanation.

@mstange
Copy link
Owner

mstange commented Jul 13, 2024

Everything's working as intended, closing. Feel free to keep asking questions here, I'm just marking this as "no work needs to be done".

@mstange mstange closed this as completed Jul 13, 2024
@mstange
Copy link
Owner

mstange commented Jul 13, 2024

I couldn't resist looking at your profile and had some ideas for optimizations.

These are all about RPM parsing inside the rpm crate:

  • rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);
  • rpm::rpm::package::PackageMetadata::get_file_entries collects a lot of metadata that you don't need; you only need the paths and the links. There's already a method called get_file_paths to get just the paths; maybe the rpm crate can add another one to get just the links, e.g. get_file_links.
  • IndexData::StringArray uses nom's complete::take_till to find the nul byte. It could instead use the memchr crate for SIMD-accelerated nul byte finding, and then it can do the slicing without nom.
  • This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);
  • String::from_utf8_lossy is called for all strings in all header entries, even for entry types that you don't look at. And in then all these strings get converted into unix paths anyway, which don't need to be utf-8. Furthermore, the string bytes are copied twice: First from the file into a temporary Vec, and then from that Vec into the the header entries. I think one could rework header parsing as follows, with some breaking changes for the API of the rpm crate: During Header::parse, only read the bytes from the input and store them in a Vec that becomes a permanent part of the header. Also parse the IndexHeader and the list of IndexEntrys, and sanity-check the size of the remaining bytes given the sizes expected by the entries. But don't do any entry-type specific parsing here. Only once somebody calls Header::get_entry_data_as_string_array or one of its friends, get the slice for the entry data from the Vec that's stored in the header. At this point the bytes can be converted into the expected format. For the string array cases, I would just compute byte slices into the header Vec and not convert those slices to strings (don't copy, don't utf-8 validate). Then, when get_file_paths makes the paths, it can just call Path::new(OsStr::from_bytes(byte_slice)) (at least if the target is unix), which would let it skip the utf-8 parsing.

On the debuginfod-rs side, I think it's worth thinking about a way to kick off the RPM parsing before all the upfront directory traversal is done. This would get you improved parallelism. Also, unrelated to performance, I noticed that get_build_id_for_dwz could make use of memchr::memmem::find.

@marxin
Copy link
Author

marxin commented Jul 17, 2024

Really appreciate your observations - they are all good learning material and I've just copied your reply message to RPM project so that the maintainer is aware of it.

When it comes to debuginfod-rs: it's just a hobby project (that is ~30x faster than the official elfutils implementation even with the aforementioned issues) and the bottleneck would be the disk drive.

Thank you!

@mstange
Copy link
Owner

mstange commented Jul 17, 2024

When it comes to debuginfod-rs: it's just a hobby project (that is ~30x faster than the official elfutils implementation even with the aforementioned issues)

That's pretty impressive!

and the bottleneck would be the disk drive.

I would have expected this too, but that's not what your profile indicated. The threads were busy at 100% CPU usage, and they weren't spending any time in the kernel pagefaulting. It looked really quite CPU bound.

@marxin
Copy link
Author

marxin commented Jul 17, 2024

I would have expected this too, but that's not what your profile indicated. The threads were busy at 100% CPU usage, and they weren't spending any time in the kernel pagefaulting. It looked really quite CPU bound.

Oh yeah! The profile I presented is actually collected after my page cache is filled for the disk reads. The cold start takes about 4x more time and that's with an extremely fast SSD (~10GB/s).

@mstange
Copy link
Owner

mstange commented Jul 17, 2024

That makes sense. Thanks for clearing that up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants