-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize ToString
implementation for integers
#136264
base: master
Are you sure you want to change the base?
Optimize ToString
implementation for integers
#136264
Conversation
Could not assign reviewer from: |
rustbot has assigned @Mark-Simulacrum. Use |
This comment has been minimized.
This comment has been minimized.
I'm very confused by the CI errors. Did I unfold a very weird bug somehow? Like |
specialization strikes again, perhaps |
It seems to me that the tests are testing for a known miscompilation: #107975 |
I'm guessing that this PR caused the compiler to be able to figure out that calling to_string on a number doesn't cause that number to change, allowing more optimizations to happen, causing the miscompilation to behave differently. |
lovely. |
Fixed CI. So now about benching the change: any suggestion @workingjubilee ? In the meantime: @bors try @rust-timer queue |
This comment has been minimized.
This comment has been minimized.
…string, r=<try> Optimize `ToString` implementation for integers Part of rust-lang#135543. Follow-up of rust-lang#133247 and rust-lang#128204. Rather than writing pretty bad benchers like I did last time, `@workingjubilee:` do you have a suggestion on how to check the impact on performance for this PR? Thanks in advance! r? `@workingjubilee`
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (97c5b4b): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary 0.3%, secondary 2.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeResults (primary 0.1%, secondary 0.2%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 776.287s -> 777.396s (0.14%) |
I don't believe the tests should be changed in this way. |
I'm not sure if we should keep the code and modify it (like I did) or just comment out the parts that are failing. Or maybe you see a third way? |
The point of these tests is the series of asserts, changing just one line or two that fails undermines the point. |
The point was testing that calling |
These are known-bug tests. They are testing for the existence of a bug. We did not fix the bug, because there are other ways to reach the compilation quirks in question. They have little to nothing to do with |
Oh I see. So in short, calling any method on the type should still trigger the original bug iiuc. Let me give it a try. In the meantime, any idea on how to bench this PR? |
83dc76e
to
1713773
Compare
You were right, just calling another method did the trick. Tests are now back to what they were. |
Thank you. I will think about that a bit. |
☔ The latest upstream changes (presumably #135265) made this pull request unmergeable. Please resolve the merge conflicts. |
1713773
to
86056b0
Compare
Fixed merge conflicts. Also cc @tgross35 since you mentioned you were also working on this. |
r? @workingjubilee (I guess) -- happy to get re-rolled but seems like you might have more context here? |
Could not assign reviewer from: |
If I am I'm not aware of it :) |
Oh, there was some confusion on another issue - I'm working on float string conversions, not integers. |
86056b0
to
372d69d
Compare
Came back to this. I wrote this code: use std::io::{self, Write};
fn main() {
let a = 0u8.to_string();
let b = 0i8.to_string();
io::stdout().write_all(a.as_bytes()).unwrap();
io::stdout().write_all(b.as_bytes()).unwrap();
let a = 0u16.to_string();
let b = 0i16.to_string();
io::stdout().write_all(a.as_bytes()).unwrap();
io::stdout().write_all(b.as_bytes()).unwrap();
let a = 0u32.to_string();
let b = 0i32.to_string();
io::stdout().write_all(a.as_bytes()).unwrap();
io::stdout().write_all(b.as_bytes()).unwrap();
let a = 0u64.to_string();
let b = 0i64.to_string();
io::stdout().write_all(a.as_bytes()).unwrap();
io::stdout().write_all(b.as_bytes()).unwrap();
} Then I generated assembly (with
So if anything, it at least generates a smaller assembly code (as expected). So based on this, is there anything else to be done? |
#[doc(hidden)] | ||
#[unstable( | ||
feature = "fmt_internals", | ||
reason = "internal routines only exposed for testing", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems inaccurate? This is being used for non-test code -- is their precedent for that? Maybe this should be core_internals or some other internal feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad copy-pasting, good catch. Going to fix it.
@@ -2828,6 +2875,7 @@ impl SpecToString for u8 { | |||
} | |||
|
|||
#[cfg(not(no_global_oom_handling))] | |||
#[cfg(feature = "optimize_for_size")] | |||
impl SpecToString for i8 { | |||
#[inline] | |||
fn spec_to_string(&self) -> String { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, it looks like there are two separate places we now have size and non-size optimized code for printing integers (cfgs in core/src/fmt/num.rs, and here). Could we perhaps unify on just one place where the full set of code lives?
Part of why I'm asking is that it seems like there is some amount of strange choices (IMO):
- fast ToString for i{8 to 64} => String with capacity for maximum sized integer (e.g.,
0u64.to_string()
will give me a String with capacity for at least 20 bytes) - fast ToString for u{8 to 64} => dispatches through &str to String, so will perfectly size the heap buffer based on the actual length
- small ToString for u8/i8 => maximum sized integer allocations
- these override the support in core for _fmt on u8/i8
So for the byte types (u8/i8) there's actually 4 separate pieces of code that we are maintaining:
- Size-optimized core::fmt::num Display impl (used IIUC for
{}
formatting) - Fast-optimized core::fmt::num Display impl (used IIUC for
{}
formatting) - Size-optimized alloc SpecToString impl (for
.to_string()
) - Fast-optimized alloc SpecToString impl (for
.to_string()
) -- defers now to core::fmt::num
Plus, IIUC the signed core::fmt::num impl is now only reachable via Display, never via .to_string()
, which also seems like an odd decision.
I also don't see much in the way of rationale for why we are making certain tradeoffs (e.g., why single-byte types are special cased here, but not for Display). Maybe we can file a tracking issue of some kind and layout a plan for what we're envisioning the end state to be? The individual changes here are I guess fine, but it doesn't seem like we're moving towards a specific vision, rather tweaking to optimize a particular metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want it to be part of this PR or as follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd personally rather see a plan and cleanup work followed by "random" changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the plan is: making integer to string conversion faster. There were a few problems when I started working on this:
- The buffer used to store the string output for the integer was always the size of the biggest integer (64 bits), which is suboptimal for smaller integers.
- We had an extra loop which was never entered for smaller integers (
i8
/u8
), but since we were casting all integers intou64
before converting to string, this optimization was missed. - The
ToString
implementation uses the same code, which relies onFormatter
, meaning that all theFormatter
code (checking the internal flags in short) was still run even though it was never actually used.
The points 1. and 2. were fixed in #128204. This PR is fixing the last one.
Now about the optimize_for_size
feature usage: considering these optimizations require specialized code, it also means that it needs a lot more code. And because the code to convert integers to string isn't the same depending on whether or not the optimize_for_size
feature is enabled, I can't have the same code because the internal API changes.
Does it answer your question? Don't hesitate if something isn't clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).
Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?
I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa
to avoid heap allocations etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).
I posted a comparison with and without these changes in this comment. A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).
Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?
No and unless you add a new field to Formatter
allowing you to know that no field was updated and that the checks should be skipped, I don't see how you could get this optimization.
I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa to avoid heap allocations etc.
I'm not convinced either but since the optimize_for_size
feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size
but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size
so on this side we're good.
Also, I'm not looking for the fastest implementation, I'm looking at improving the current situation which is really suboptimal. We could add a new write_into<W: Write>(self, &mut W)
method to have something as fast as itoa
, but that's a whole other discussion and I don't plan to start it. My plan ends with this PR. Also to be noted: with this PR, the only remaining difference with itoa
is that we don't allow to write an integer into a String, everything else is the exact same.
Anyway, the PR is ready, it has a visible impact on at least the generated assembly which is notably smaller by allowing to skip all Formatter
code. It doesn't change the behaviour of optimize_for_size
and adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).
Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.
Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.
I'm not convinced either but since the optimize_for_size feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size so on this side we're good.
This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.
adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.
I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.
Considering how specific this is, not sure it's worth adding to rustc-perf. As for adding them into the codebase, I'll need to ensure that they're correctly written first.
Here is the code I used:
#![feature(test)]
extern crate test;
use test::{Bencher, black_box};
#[inline(always)]
fn convert_to_string<T: ToString>(n: T) -> String {
n.to_string()
}
macro_rules! decl_benches {
($($name:ident: $ty:ident,)+) => {
$(
#[bench]
fn $name(c: &mut Bencher) {
c.iter(|| convert_to_string(black_box({ let nb: $ty = 20; nb })));
}
)+
}
}
decl_benches! {
bench_u8: u8,
bench_i8: i8,
bench_u16: u16,
bench_i16: i16,
bench_u32: u32,
bench_i32: i32,
bench_u64: u64,
bench_i64: i64,
}
The results are:
name | 1.87.0-nightly (3ea711f 2025-03-09) | With this PR | diff |
---|---|---|---|
bench_i16 | 32.06 ns/iter (+/- 0.12) | 17.62 ns/iter (+/- 0.03) | -45% |
bench_i32 | 31.61 ns/iter (+/- 0.04) | 15.10 ns/iter (+/- 0.06) | -52% |
bench_i64 | 31.71 ns/iter (+/- 0.07) | 15.02 ns/iter (+/- 0.20) | -52% |
bench_i8 | 13.21 ns/iter (+/- 0.14) | 14.93 ns/iter (+/- 0.16) | +13% |
bench_u16 | 31.20 ns/iter (+/- 0.06) | 16.14 ns/iter (+/- 0.11) | -48% |
bench_u32 | 33.27 ns/iter (+/- 0.05) | 16.18 ns/iter (+/- 0.10) | -51% |
bench_u64 | 31.44 ns/iter (+/- 0.06) | 16.62 ns/iter (+/- 0.21) | -47% |
bench_u8 | 10.57 ns/iter (+/- 0.30) | 13.00 ns/iter (+/- 0.43) | +22% |
I have to admit I'm a bit surprised as I didn't remember the difference to be this big... But in any case, seeing how big the difference is, I wonder if the benches are correctly written (hence why I asked help for it).
Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.
Yep, hence why I wrote benches. :)
This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.
There is no complete test for this as far as I can see, but some small checks like tests/ui/traits/to-str.rs
and test_simple_types
in library/alloc/tests/string.rs
.
Might be worth adding one?
I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).
I can help with maintaining this code. The plan is not to be as good as itoa
(which isn't possible anyway with the current API) but to be as good as possible in the current limitations. The improvements seem noticeable and I think are worth it. I also think that doing integers to string conversion is very common, and I even if we provide new APIs to handle them (which would be nice), this code is common enough to make a nice impact in existing codebases.
86a3315
to
bf99f6f
Compare
I also rebased the code (did it to match the nightly I used to compare for the benches). |
Part of #135543.
Follow-up of #133247 and #128204.
Rather than writing pretty bad benchers like I did last time, @workingjubilee: do you have a suggestion on how to check the impact on performance for this PR? Thanks in advance!
r? @workingjubilee