Samply reports different results based on perf.data compared to perf or Inferno #290

marxin · 2024-07-12T18:24:26Z

I've got an application that scans RPM files and builds build-id debuginfod index of that. If I run perf, I've got the following profile (the results are expected based on what the program does):

$ perf record -g --call-graph dwarf target/release/debuginfod-rs /home/marxin/Downloads/RPM
...
$ perf report --no-children
+   12.07%  debuginfod-rs    libc.so.6             [.] __memmove_avx512_unaligned_erms
+    9.40%  debuginfod-rs    debuginfod-rs         [.] rpm::rpm::headers::header::Header<T>::parse
+    7.90%  debuginfod-rs    libc.so.6             [.] _int_free
+    7.12%  debuginfod-rs    libc.so.6             [.] malloc
+    7.10%  debuginfod-rs    debuginfod-rs         [.] <core::str::lossy::Utf8Chunks as core::iter::traits::iterator::Iterator>::next
+    4.82%  debuginfod-rs    debuginfod-rs         [.] rpm::rpm::package::PackageMetadata::get_file_entries
+    4.45%  debuginfod-rs    debuginfod-rs         [.] rpm::rpm::headers::header::Header<rpm::constants::IndexSignatureTag>::parse_signature
+    4.42%  debuginfod-rs    libc.so.6             [.] malloc_consolidate
+    4.11%  debuginfod-rs    libc.so.6             [.] unlink_chunk.isra.0
+    2.30%  debuginfod-rs    libc.so.6             [.] cfree@GLIBC_2.2.5
+    2.15%  debuginfod-rs    debuginfod-rs         [.] alloc::string::String::from_utf8_lossy
+    2.03%  debuginfod-rs    libc.so.6             [.] __memset_avx512_unaligned_erms

If I run Inferno, I get the following Flamegraph (note I collapse the call stacks as the application uses Rayon):

perf script | inferno-collapse-perf --skip-after=debuginfod_rs::Server::analyze_file | inferno-flamegraph --minwidth=1

However, if I try Samply, I get the following Flamegraph:

Most of the profile is taken by functions (e.g. std::sys::pal::unix::fs::stat 22%) which I cannot see in the Perf report or Inferno flamegraph).

Can you please take a look?

Needed files:
https://splichal.eu/tmp/debuginfod-rs.zst
https://splichal.eu/tmp/perf.data.zst

The text was updated successfully, but these errors were encountered:

mstange · 2024-07-12T18:28:51Z

https://splichal.eu/tmp/perf.data.zst gives me an "Access forbidden!" error.

mstange · 2024-07-12T18:33:42Z

Likely what's happening is that samply is just giving you much more detailed information, broken down by thread.

In the samply flamegraph, at the bottom you can see debuginfo_rs::main calling debuginfo_rs::Server::walk. I can find those two functions in the SVG flamegraph on the very left end (a small red stack). The rest of the work shown in the SVG flamegraph is probably happening on different threads, which you can switch through individually in the Firefox Profiler UI.

The SVG flame graph says there were 3598 samples inside debuginfo_rs::Server::walk. The SVG gives you just a single small box for those 3598 samples. samply and the Firefox Profiler break those 3598 samples down into all the detail from your samply screenshot. A lot of that detail is probably due to having inline stacks. Also, I think inferno throws out boxes that would be smaller than a certain percentage.

marxin · 2024-07-12T18:50:35Z

https://splichal.eu/tmp/perf.data.zst gives me an "Access forbidden!" error.

Please try it now! I've used scp, but forgot to use chown.

marxin · 2024-07-12T19:00:35Z

In the samply flamegraph, at the bottom you can see debuginfo_rs::main calling debuginfo_rs::Server::walk. I can find those two functions in the SVG flamegraph on the very left end (a small red stack). The rest of the work shown in the SVG flamegraph is probably happening on different threads, which you can switch through individually in the Firefox Profiler UI.

Ok, I've just tried to select one thread and filter for parse_header function. I can confirm I can find it:

But it seems to suffer from the very same problem as Inferno. Even for a simple thread, the number of call stacks leading to the filtered function is enormous and one gets a gazillion of vertical lines, each representing its stack.

What's also interesting is that if I select a single thread, I can't see any value in the Self column:

Is it correct?

Based on the files I provides, can you open the perf profile on your box?

mstange · 2024-07-12T19:06:42Z

The flame graph and call tree visualizations aren't a great fit for workloads which make a lot of use of recursion. Rayon uses recursion extensively. I agree it makes for a very messy flame graph.
You can use the "collapse recursion" transform from the right-click menu to improve things a little bit. But I agree it's still quite messy. I had more success once I right-clicked debuginfod_rs::Server::walk::{{closure}} and chose "Focus on function".

A better view for these types of profiles would be a "top functions" view (firefox-devtools/profiler#15 is filed about this).

mstange · 2024-07-13T02:06:51Z

Ah, I see your use of --skip-after=debuginfod_rs::Server::analyze_file now. The "Focus on function" transform is similar to that. The difference is that "Focus on function" will also throw out any stacks that don't include the selected function, whereas --skip-after keeps them.

mstange · 2024-07-13T02:09:06Z

What's also interesting is that if I select a single thread, I can't see any value in the Self column:

Is it correct?

Yes, it is correct. If you drill down deeper into the stack you will eventually end up at call nodes with non-zero self times. See https://profiler.firefox.com/docs/#/./guide-stack-samples-and-call-trees?id=self-time-in-the-call-tree for a more detailed explanation.

mstange · 2024-07-13T02:10:09Z

Everything's working as intended, closing. Feel free to keep asking questions here, I'm just marking this as "no work needs to be done".

mstange · 2024-07-13T04:29:15Z

I couldn't resist looking at your profile and had some ideas for optimizations.

These are all about RPM parsing inside the rpm crate:

rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);
rpm::rpm::package::PackageMetadata::get_file_entries collects a lot of metadata that you don't need; you only need the paths and the links. There's already a method called get_file_paths to get just the paths; maybe the rpm crate can add another one to get just the links, e.g. get_file_links.
IndexData::StringArray uses nom's complete::take_till to find the nul byte. It could instead use the memchr crate for SIMD-accelerated nul byte finding, and then it can do the slicing without nom.
This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);
String::from_utf8_lossy is called for all strings in all header entries, even for entry types that you don't look at. And in then all these strings get converted into unix paths anyway, which don't need to be utf-8. Furthermore, the string bytes are copied twice: First from the file into a temporary Vec, and then from that Vec into the the header entries. I think one could rework header parsing as follows, with some breaking changes for the API of the rpm crate: During Header::parse, only read the bytes from the input and store them in a Vec that becomes a permanent part of the header. Also parse the IndexHeader and the list of IndexEntrys, and sanity-check the size of the remaining bytes given the sizes expected by the entries. But don't do any entry-type specific parsing here. Only once somebody calls Header::get_entry_data_as_string_array or one of its friends, get the slice for the entry data from the Vec that's stored in the header. At this point the bytes can be converted into the expected format. For the string array cases, I would just compute byte slices into the header Vec and not convert those slices to strings (don't copy, don't utf-8 validate). Then, when get_file_paths makes the paths, it can just call Path::new(OsStr::from_bytes(byte_slice)) (at least if the target is unix), which would let it skip the utf-8 parsing.

On the debuginfod-rs side, I think it's worth thinking about a way to kick off the RPM parsing before all the upfront directory traversal is done. This would get you improved parallelism. Also, unrelated to performance, I noticed that get_build_id_for_dwz could make use of memchr::memmem::find.

marxin · 2024-07-17T18:46:44Z

Really appreciate your observations - they are all good learning material and I've just copied your reply message to RPM project so that the maintainer is aware of it.

When it comes to debuginfod-rs: it's just a hobby project (that is ~30x faster than the official elfutils implementation even with the aforementioned issues) and the bottleneck would be the disk drive.

Thank you!

mstange · 2024-07-17T19:10:47Z

When it comes to debuginfod-rs: it's just a hobby project (that is ~30x faster than the official elfutils implementation even with the aforementioned issues)

That's pretty impressive!

and the bottleneck would be the disk drive.

I would have expected this too, but that's not what your profile indicated. The threads were busy at 100% CPU usage, and they weren't spending any time in the kernel pagefaulting. It looked really quite CPU bound.

marxin · 2024-07-17T19:22:15Z

I would have expected this too, but that's not what your profile indicated. The threads were busy at 100% CPU usage, and they weren't spending any time in the kernel pagefaulting. It looked really quite CPU bound.

Oh yeah! The profile I presented is actually collected after my page cache is filled for the disk reads. The cold start takes about 4x more time and that's with an extremely fast SSD (~10GB/s).

mstange · 2024-07-17T19:23:18Z

That makes sense. Thanks for clearing that up :)

mstange closed this as completed Jul 13, 2024

marxin mentioned this issue Jul 17, 2024

Parsing performance opportunities rpm-rs/rpm#234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Samply reports different results based on perf.data compared to perf or Inferno #290

Samply reports different results based on perf.data compared to perf or Inferno #290

marxin commented Jul 12, 2024

mstange commented Jul 12, 2024

mstange commented Jul 12, 2024

marxin commented Jul 12, 2024

marxin commented Jul 12, 2024

mstange commented Jul 12, 2024

mstange commented Jul 13, 2024

mstange commented Jul 13, 2024 •

edited

Loading

mstange commented Jul 13, 2024

mstange commented Jul 13, 2024 •

edited

Loading

marxin commented Jul 17, 2024

mstange commented Jul 17, 2024

marxin commented Jul 17, 2024 •

edited

Loading

mstange commented Jul 17, 2024

Samply reports different results based on perf.data compared to perf or Inferno #290

Samply reports different results based on perf.data compared to perf or Inferno #290

Comments

marxin commented Jul 12, 2024

mstange commented Jul 12, 2024

mstange commented Jul 12, 2024

marxin commented Jul 12, 2024

marxin commented Jul 12, 2024

mstange commented Jul 12, 2024

mstange commented Jul 13, 2024

mstange commented Jul 13, 2024 • edited Loading

mstange commented Jul 13, 2024

mstange commented Jul 13, 2024 • edited Loading

marxin commented Jul 17, 2024

mstange commented Jul 17, 2024

marxin commented Jul 17, 2024 • edited Loading

mstange commented Jul 17, 2024

mstange commented Jul 13, 2024 •

edited

Loading

mstange commented Jul 13, 2024 •

edited

Loading

marxin commented Jul 17, 2024 •

edited

Loading