Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize ToString implementation for integers #136264

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

GuillaumeGomez
Copy link
Member

Part of #135543.

Follow-up of #133247 and #128204.

Rather than writing pretty bad benchers like I did last time, @workingjubilee: do you have a suggestion on how to check the impact on performance for this PR? Thanks in advance!

r? @workingjubilee

@rustbot
Copy link
Collaborator

rustbot commented Jan 29, 2025

Could not assign reviewer from: workingjubilee.
User(s) workingjubilee are either the PR author, already assigned, or on vacation. Please use r? to specify someone else to assign.

@rustbot
Copy link
Collaborator

rustbot commented Jan 29, 2025

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 29, 2025
@rust-log-analyzer

This comment has been minimized.

@GuillaumeGomez
Copy link
Member Author

I'm very confused by the CI errors. Did I unfold a very weird bug somehow? Like 0 != 0. 😮

@workingjubilee
Copy link
Member

specialization strikes again, perhaps

@theemathas
Copy link
Contributor

It seems to me that the tests are testing for a known miscompilation: #107975

@theemathas
Copy link
Contributor

theemathas commented Jan 30, 2025

I'm guessing that this PR caused the compiler to be able to figure out that calling to_string on a number doesn't cause that number to change, allowing more optimizations to happen, causing the miscompilation to behave differently.

@workingjubilee
Copy link
Member

lovely.

@GuillaumeGomez
Copy link
Member Author

Fixed CI. So now about benching the change: any suggestion @workingjubilee ?

In the meantime:

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 30, 2025
bors added a commit to rust-lang-ci/rust that referenced this pull request Jan 30, 2025
…string, r=<try>

Optimize `ToString` implementation for integers

Part of rust-lang#135543.

Follow-up of rust-lang#133247 and rust-lang#128204.

Rather than writing pretty bad benchers like I did last time, `@workingjubilee:` do you have a suggestion on how to check the impact on performance for this PR? Thanks in advance!

r? `@workingjubilee`
@bors
Copy link
Contributor

bors commented Jan 30, 2025

⌛ Trying commit 83dc76e with merge 97c5b4b...

@bors
Copy link
Contributor

bors commented Jan 30, 2025

☀️ Try build successful - checks-actions
Build commit: 97c5b4b (97c5b4b9bc9a34c7dde8738389ba12cb733cd54e)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (97c5b4b): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-0.4% [-0.6%, -0.1%] 2
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) -0.4% [-0.6%, -0.1%] 2

Max RSS (memory usage)

Results (primary 0.3%, secondary 2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.2% [3.2%, 3.2%] 1
Regressions ❌
(secondary)
2.1% [2.1%, 2.1%] 1
Improvements ✅
(primary)
-2.6% [-2.6%, -2.6%] 1
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.3% [-2.6%, 3.2%] 2

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.1%, secondary 0.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.1% [0.0%, 1.0%] 22
Regressions ❌
(secondary)
0.5% [0.5%, 0.5%] 3
Improvements ✅
(primary)
-0.1% [-0.2%, -0.0%] 6
Improvements ✅
(secondary)
-0.1% [-0.1%, -0.1%] 3
All ❌✅ (primary) 0.1% [-0.2%, 1.0%] 28

Bootstrap: 776.287s -> 777.396s (0.14%)
Artifact size: 328.44 MiB -> 328.53 MiB (0.03%)

@rustbot rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 30, 2025
@workingjubilee
Copy link
Member

I don't believe the tests should be changed in this way.

@GuillaumeGomez
Copy link
Member Author

I'm not sure if we should keep the code and modify it (like I did) or just comment out the parts that are failing. Or maybe you see a third way?

@workingjubilee
Copy link
Member

The point of these tests is the series of asserts, changing just one line or two that fails undermines the point.

@GuillaumeGomez
Copy link
Member Author

The point was testing that calling to_string on them somehow changed the value. The new checks ensure that it doesn't and I commented as such. Not too sure what else to do on them.

@workingjubilee
Copy link
Member

These are known-bug tests.

They are testing for the existence of a bug.

We did not fix the bug, because there are other ways to reach the compilation quirks in question. They have little to nothing to do with to_string in particular. We changed the code it was testing from under its feet, so the test has to be different code in order to still be a useful test, as what we want is to verify when LLVM stops miscompiling a sequence of LLVMIR.

@GuillaumeGomez
Copy link
Member Author

Oh I see. So in short, calling any method on the type should still trigger the original bug iiuc. Let me give it a try. In the meantime, any idea on how to bench this PR?

@GuillaumeGomez GuillaumeGomez force-pushed the optimize-integers-to-string branch from 83dc76e to 1713773 Compare January 30, 2025 17:27
@GuillaumeGomez
Copy link
Member Author

You were right, just calling another method did the trick. Tests are now back to what they were.

@workingjubilee
Copy link
Member

Thank you. I will think about that a bit.

@bors
Copy link
Contributor

bors commented Feb 4, 2025

☔ The latest upstream changes (presumably #135265) made this pull request unmergeable. Please resolve the merge conflicts.

@GuillaumeGomez GuillaumeGomez force-pushed the optimize-integers-to-string branch from 1713773 to 86056b0 Compare February 5, 2025 13:46
@GuillaumeGomez
Copy link
Member Author

Fixed merge conflicts.

Also cc @tgross35 since you mentioned you were also working on this.

@GuillaumeGomez GuillaumeGomez marked this pull request as ready for review February 5, 2025 15:18
@Mark-Simulacrum
Copy link
Member

r? @workingjubilee (I guess) -- happy to get re-rolled but seems like you might have more context here?

@rustbot
Copy link
Collaborator

rustbot commented Feb 9, 2025

Could not assign reviewer from: workingjubilee.
User(s) workingjubilee are either the PR author, already assigned, or on vacation. Please use r? to specify someone else to assign.

@tgross35
Copy link
Contributor

Also cc @tgross35 since you mentioned you were also working on this.

If I am I'm not aware of it :)

@tgross35
Copy link
Contributor

Oh, there was some confusion on another issue - I'm working on float string conversions, not integers.

@GuillaumeGomez GuillaumeGomez force-pushed the optimize-integers-to-string branch from 86056b0 to 372d69d Compare February 26, 2025 20:05
@GuillaumeGomez
Copy link
Member Author

Came back to this. I wrote this code:

use std::io::{self, Write};

fn main() {
    let a = 0u8.to_string();
    let b = 0i8.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();

    let a = 0u16.to_string();
    let b = 0i16.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();

    let a = 0u32.to_string();
    let b = 0i32.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();

    let a = 0u64.to_string();
    let b = 0i64.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();
}

Then I generated assembly (with --emit asm) with and without this PR's changes.

with this PR's changes without this PR's changes
125967 (125.9 KB) 129671 (129.6 KB)

So if anything, it at least generates a smaller assembly code (as expected).

So based on this, is there anything else to be done?

#[doc(hidden)]
#[unstable(
feature = "fmt_internals",
reason = "internal routines only exposed for testing",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems inaccurate? This is being used for non-test code -- is their precedent for that? Maybe this should be core_internals or some other internal feature?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad copy-pasting, good catch. Going to fix it.

@@ -2828,6 +2875,7 @@ impl SpecToString for u8 {
}

#[cfg(not(no_global_oom_handling))]
#[cfg(feature = "optimize_for_size")]
impl SpecToString for i8 {
#[inline]
fn spec_to_string(&self) -> String {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it looks like there are two separate places we now have size and non-size optimized code for printing integers (cfgs in core/src/fmt/num.rs, and here). Could we perhaps unify on just one place where the full set of code lives?

Part of why I'm asking is that it seems like there is some amount of strange choices (IMO):

  • fast ToString for i{8 to 64} => String with capacity for maximum sized integer (e.g., 0u64.to_string() will give me a String with capacity for at least 20 bytes)
  • fast ToString for u{8 to 64} => dispatches through &str to String, so will perfectly size the heap buffer based on the actual length
  • small ToString for u8/i8 => maximum sized integer allocations
    • these override the support in core for _fmt on u8/i8

So for the byte types (u8/i8) there's actually 4 separate pieces of code that we are maintaining:

  • Size-optimized core::fmt::num Display impl (used IIUC for {} formatting)
  • Fast-optimized core::fmt::num Display impl (used IIUC for {} formatting)
  • Size-optimized alloc SpecToString impl (for .to_string())
  • Fast-optimized alloc SpecToString impl (for .to_string()) -- defers now to core::fmt::num

Plus, IIUC the signed core::fmt::num impl is now only reachable via Display, never via .to_string(), which also seems like an odd decision.

I also don't see much in the way of rationale for why we are making certain tradeoffs (e.g., why single-byte types are special cased here, but not for Display). Maybe we can file a tracking issue of some kind and layout a plan for what we're envisioning the end state to be? The individual changes here are I guess fine, but it doesn't seem like we're moving towards a specific vision, rather tweaking to optimize a particular metric.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want it to be part of this PR or as follow-up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd personally rather see a plan and cleanup work followed by "random" changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the plan is: making integer to string conversion faster. There were a few problems when I started working on this:

  1. The buffer used to store the string output for the integer was always the size of the biggest integer (64 bits), which is suboptimal for smaller integers.
  2. We had an extra loop which was never entered for smaller integers (i8/u8), but since we were casting all integers into u64 before converting to string, this optimization was missed.
  3. The ToString implementation uses the same code, which relies on Formatter, meaning that all the Formatter code (checking the internal flags in short) was still run even though it was never actually used.

The points 1. and 2. were fixed in #128204. This PR is fixing the last one.

Now about the optimize_for_size feature usage: considering these optimizations require specialized code, it also means that it needs a lot more code. And because the code to convert integers to string isn't the same depending on whether or not the optimize_for_size feature is enabled, I can't have the same code because the internal API changes.

Does it answer your question? Don't hesitate if something isn't clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).

Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?

I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa to avoid heap allocations etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).

I posted a comparison with and without these changes in this comment. A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).

Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?

No and unless you add a new field to Formatter allowing you to know that no field was updated and that the checks should be skipped, I don't see how you could get this optimization.

I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa to avoid heap allocations etc.

I'm not convinced either but since the optimize_for_size feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size so on this side we're good.

Also, I'm not looking for the fastest implementation, I'm looking at improving the current situation which is really suboptimal. We could add a new write_into<W: Write>(self, &mut W) method to have something as fast as itoa, but that's a whole other discussion and I don't plan to start it. My plan ends with this PR. Also to be noted: with this PR, the only remaining difference with itoa is that we don't allow to write an integer into a String, everything else is the exact same.

Anyway, the PR is ready, it has a visible impact on at least the generated assembly which is notably smaller by allowing to skip all Formatter code. It doesn't change the behaviour of optimize_for_size and adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).

Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.

Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.

I'm not convinced either but since the optimize_for_size feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size so on this side we're good.

This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.

adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.

I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.

Considering how specific this is, not sure it's worth adding to rustc-perf. As for adding them into the codebase, I'll need to ensure that they're correctly written first.

Here is the code I used:

#![feature(test)]

extern crate test;

use test::{Bencher, black_box};

#[inline(always)]
fn convert_to_string<T: ToString>(n: T) -> String {
    n.to_string()
}

macro_rules! decl_benches {
    ($($name:ident: $ty:ident,)+) => {
        $(
	    #[bench]
            fn $name(c: &mut Bencher) {
                c.iter(|| convert_to_string(black_box({ let nb: $ty = 20; nb })));
            }
	)+
    }
}

decl_benches! {
    bench_u8: u8,
    bench_i8: i8,
    bench_u16: u16,
    bench_i16: i16,
    bench_u32: u32,
    bench_i32: i32,
    bench_u64: u64,
    bench_i64: i64,
}

The results are:

name 1.87.0-nightly (3ea711f 2025-03-09) With this PR diff
bench_i16 32.06 ns/iter (+/- 0.12) 17.62 ns/iter (+/- 0.03) -45%
bench_i32 31.61 ns/iter (+/- 0.04) 15.10 ns/iter (+/- 0.06) -52%
bench_i64 31.71 ns/iter (+/- 0.07) 15.02 ns/iter (+/- 0.20) -52%
bench_i8 13.21 ns/iter (+/- 0.14) 14.93 ns/iter (+/- 0.16) +13%
bench_u16 31.20 ns/iter (+/- 0.06) 16.14 ns/iter (+/- 0.11) -48%
bench_u32 33.27 ns/iter (+/- 0.05) 16.18 ns/iter (+/- 0.10) -51%
bench_u64 31.44 ns/iter (+/- 0.06) 16.62 ns/iter (+/- 0.21) -47%
bench_u8 10.57 ns/iter (+/- 0.30) 13.00 ns/iter (+/- 0.43) +22%

I have to admit I'm a bit surprised as I didn't remember the difference to be this big... But in any case, seeing how big the difference is, I wonder if the benches are correctly written (hence why I asked help for it).

Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.

Yep, hence why I wrote benches. :)

This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.

There is no complete test for this as far as I can see, but some small checks like tests/ui/traits/to-str.rs and test_simple_types in library/alloc/tests/string.rs.

Might be worth adding one?

I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).

I can help with maintaining this code. The plan is not to be as good as itoa (which isn't possible anyway with the current API) but to be as good as possible in the current limitations. The improvements seem noticeable and I think are worth it. I also think that doing integers to string conversion is very common, and I even if we provide new APIs to handle them (which would be nice), this code is common enough to make a nice impact in existing codebases.

@GuillaumeGomez GuillaumeGomez force-pushed the optimize-integers-to-string branch from 86a3315 to bf99f6f Compare March 10, 2025 15:33
@GuillaumeGomez
Copy link
Member Author

I also rebased the code (did it to match the nightly I used to compare for the benches).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants