Optimize `ToString` implementation for integers #136264

GuillaumeGomez · 2025-01-29T21:24:45Z

Rather than writing pretty bad benchers like I did last time, @workingjubilee: do you have a suggestion on how to check the impact on performance for this PR? Thanks in advance!

r? @workingjubilee

rustbot · 2025-01-29T21:24:48Z

Could not assign reviewer from: workingjubilee.
User(s) workingjubilee are either the PR author, already assigned, or on vacation. Please use r? to specify someone else to assign.

rustbot · 2025-01-29T21:24:54Z

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

GuillaumeGomez · 2025-01-29T22:17:07Z

I'm very confused by the CI errors. Did I unfold a very weird bug somehow? Like 0 != 0. 😮

workingjubilee · 2025-01-30T03:56:08Z

specialization strikes again, perhaps

theemathas · 2025-01-30T04:00:32Z

It seems to me that the tests are testing for a known miscompilation: #107975

theemathas · 2025-01-30T04:08:20Z

I'm guessing that this PR caused the compiler to be able to figure out that calling to_string on a number doesn't cause that number to change, allowing more optimizations to happen, causing the miscompilation to behave differently.

workingjubilee · 2025-01-30T06:04:25Z

lovely.

GuillaumeGomez · 2025-01-30T11:02:20Z

Fixed CI. So now about benching the change: any suggestion @workingjubilee ?

In the meantime:

@bors try @rust-timer queue

…string, r=<try> Optimize `ToString` implementation for integers Part of rust-lang#135543. Follow-up of rust-lang#133247 and rust-lang#128204. Rather than writing pretty bad benchers like I did last time, `@workingjubilee:` do you have a suggestion on how to check the impact on performance for this PR? Thanks in advance! r? `@workingjubilee`

bors · 2025-01-30T11:03:33Z

⌛ Trying commit 83dc76e with merge 97c5b4b...

bors · 2025-01-30T12:54:07Z

☀️ Try build successful - checks-actions
Build commit: 97c5b4b (97c5b4b9bc9a34c7dde8738389ba12cb733cd54e)

rust-timer · 2025-01-30T14:09:49Z

Finished benchmarking commit (97c5b4b): comparison URL.

Overall result: ✅ improvements - no action needed

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

@bors rollup=never
@rustbot label: -S-waiting-on-perf -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.4%	[-0.6%, -0.1%]	2
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.4%	[-0.6%, -0.1%]	2

Max RSS (memory usage)

Results (primary 0.3%, secondary 2.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	3.2%	[3.2%, 3.2%]	1
Regressions ❌ (secondary)	2.1%	[2.1%, 2.1%]	1
Improvements ✅ (primary)	-2.6%	[-2.6%, -2.6%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	0.3%	[-2.6%, 3.2%]	2

Cycles

This benchmark run did not return any relevant results for this metric.

Binary size

Results (primary 0.1%, secondary 0.2%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	0.1%	[0.0%, 1.0%]	22
Regressions ❌ (secondary)	0.5%	[0.5%, 0.5%]	3
Improvements ✅ (primary)	-0.1%	[-0.2%, -0.0%]	6
Improvements ✅ (secondary)	-0.1%	[-0.1%, -0.1%]	3
All ❌✅ (primary)	0.1%	[-0.2%, 1.0%]	28

Bootstrap: 776.287s -> 777.396s (0.14%)
Artifact size: 328.44 MiB -> 328.53 MiB (0.03%)

workingjubilee · 2025-01-30T16:17:45Z

I don't believe the tests should be changed in this way.

GuillaumeGomez · 2025-01-30T16:19:48Z

I'm not sure if we should keep the code and modify it (like I did) or just comment out the parts that are failing. Or maybe you see a third way?

workingjubilee · 2025-01-30T16:40:42Z

The point of these tests is the series of asserts, changing just one line or two that fails undermines the point.

GuillaumeGomez · 2025-01-30T16:43:02Z

The point was testing that calling to_string on them somehow changed the value. The new checks ensure that it doesn't and I commented as such. Not too sure what else to do on them.

workingjubilee · 2025-01-30T16:57:33Z

These are known-bug tests.

They are testing for the existence of a bug.

We did not fix the bug, because there are other ways to reach the compilation quirks in question. They have little to nothing to do with to_string in particular. We changed the code it was testing from under its feet, so the test has to be different code in order to still be a useful test, as what we want is to verify when LLVM stops miscompiling a sequence of LLVMIR.

GuillaumeGomez · 2025-01-30T16:59:21Z

Oh I see. So in short, calling any method on the type should still trigger the original bug iiuc. Let me give it a try. In the meantime, any idea on how to bench this PR?

GuillaumeGomez · 2025-01-30T17:28:15Z

You were right, just calling another method did the trick. Tests are now back to what they were.

workingjubilee · 2025-01-30T19:10:10Z

Thank you. I will think about that a bit.

bors · 2025-02-04T13:02:05Z

☔ The latest upstream changes (presumably #135265) made this pull request unmergeable. Please resolve the merge conflicts.

GuillaumeGomez · 2025-02-05T13:46:47Z

Fixed merge conflicts.

Also cc @tgross35 since you mentioned you were also working on this.

Mark-Simulacrum · 2025-02-09T19:37:34Z

r? @workingjubilee (I guess) -- happy to get re-rolled but seems like you might have more context here?

rustbot · 2025-02-09T19:37:36Z

Could not assign reviewer from: workingjubilee.
User(s) workingjubilee are either the PR author, already assigned, or on vacation. Please use r? to specify someone else to assign.

tgross35 · 2025-02-10T00:11:27Z

Also cc @tgross35 since you mentioned you were also working on this.

If I am I'm not aware of it :)

tgross35 · 2025-02-10T00:11:51Z

Oh, there was some confusion on another issue - I'm working on float string conversions, not integers.

GuillaumeGomez · 2025-02-26T20:08:18Z

Came back to this. I wrote this code:

use std::io::{self, Write};

fn main() {
    let a = 0u8.to_string();
    let b = 0i8.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();

    let a = 0u16.to_string();
    let b = 0i16.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();

    let a = 0u32.to_string();
    let b = 0i32.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();

    let a = 0u64.to_string();
    let b = 0i64.to_string();
    io::stdout().write_all(a.as_bytes()).unwrap();
    io::stdout().write_all(b.as_bytes()).unwrap();
}

Then I generated assembly (with --emit asm) with and without this PR's changes.

with this PR's changes	without this PR's changes
125967 (125.9 KB)	129671 (129.6 KB)

So if anything, it at least generates a smaller assembly code (as expected).

So based on this, is there anything else to be done?

Mark-Simulacrum · 2025-03-01T19:13:46Z

library/core/src/fmt/num.rs

+            #[doc(hidden)]
+            #[unstable(
+                feature = "fmt_internals",
+                reason = "internal routines only exposed for testing",


This seems inaccurate? This is being used for non-test code -- is their precedent for that? Maybe this should be core_internals or some other internal feature?

Bad copy-pasting, good catch. Going to fix it.

Mark-Simulacrum · 2025-03-01T19:27:41Z

library/alloc/src/string.rs

@@ -2828,6 +2875,7 @@ impl SpecToString for u8 {
 }

 #[cfg(not(no_global_oom_handling))]
+#[cfg(feature = "optimize_for_size")]
 impl SpecToString for i8 {
    #[inline]
    fn spec_to_string(&self) -> String {


IIUC, it looks like there are two separate places we now have size and non-size optimized code for printing integers (cfgs in core/src/fmt/num.rs, and here). Could we perhaps unify on just one place where the full set of code lives?

Part of why I'm asking is that it seems like there is some amount of strange choices (IMO):

fast ToString for i{8 to 64} => String with capacity for maximum sized integer (e.g., 0u64.to_string() will give me a String with capacity for at least 20 bytes)

fast ToString for u{8 to 64} => dispatches through &str to String, so will perfectly size the heap buffer based on the actual length

small ToString for u8/i8 => maximum sized integer allocations

these override the support in core for _fmt on u8/i8

So for the byte types (u8/i8) there's actually 4 separate pieces of code that we are maintaining:

Size-optimized core::fmt::num Display impl (used IIUC for {} formatting)

Fast-optimized core::fmt::num Display impl (used IIUC for {} formatting)

Size-optimized alloc SpecToString impl (for .to_string())

Fast-optimized alloc SpecToString impl (for .to_string()) -- defers now to core::fmt::num

Plus, IIUC the signed core::fmt::num impl is now only reachable via Display, never via .to_string(), which also seems like an odd decision.

I also don't see much in the way of rationale for why we are making certain tradeoffs (e.g., why single-byte types are special cased here, but not for Display). Maybe we can file a tracking issue of some kind and layout a plan for what we're envisioning the end state to be? The individual changes here are I guess fine, but it doesn't seem like we're moving towards a specific vision, rather tweaking to optimize a particular metric.

Do you want it to be part of this PR or as follow-up?

I'd personally rather see a plan and cleanup work followed by "random" changes.

So the plan is: making integer to string conversion faster. There were a few problems when I started working on this:

The buffer used to store the string output for the integer was always the size of the biggest integer (64 bits), which is suboptimal for smaller integers.

We had an extra loop which was never entered for smaller integers (i8/u8), but since we were casting all integers into u64 before converting to string, this optimization was missed.

The ToString implementation uses the same code, which relies on Formatter, meaning that all the Formatter code (checking the internal flags in short) was still run even though it was never actually used.

The points 1. and 2. were fixed in #128204. This PR is fixing the last one.

Now about the optimize_for_size feature usage: considering these optimizations require specialized code, it also means that it needs a lot more code. And because the code to convert integers to string isn't the same depending on whether or not the optimize_for_size feature is enabled, I can't have the same code because the internal API changes.

Does it answer your question? Don't hesitate if something isn't clear.

What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).

Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?

I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa to avoid heap allocations etc.

What benchmarks are we using to evaluate "make it faster"? I don't see results in this PR (and e.g. the description explicitly calls out not being sure how to check).

I posted a comparison with and without these changes in this comment. A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).

Does the formatter flag checking not get optimized out after inlining? If not, maybe we can improve on that, for example by dispatching early on defaults or similar?

No and unless you add a new field to Formatter allowing you to know that no field was updated and that the checks should be skipped, I don't see how you could get this optimization.

I'm not personally convinced that having N different implementations (can you confirm all of the cases are at least equally covered by tests?) is worth what I'd expect to be marginal improvements (this is where concrete numbers would be useful to justify this), especially when I'd expect that in most cases if you want the fastest possible serialization, ToString is not what you want -- it's pretty unlikely that an owned String containing just the integer is all you need, and the moment you want more than that you're going to be reaching for e.g. itoa to avoid heap allocations etc.

I'm not convinced either but since the optimize_for_size feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size so on this side we're good.

Also, I'm not looking for the fastest implementation, I'm looking at improving the current situation which is really suboptimal. We could add a new write_into<W: Write>(self, &mut W) method to have something as fast as itoa, but that's a whole other discussion and I don't plan to start it. My plan ends with this PR. Also to be noted: with this PR, the only remaining difference with itoa is that we don't allow to write an integer into a String, everything else is the exact same.

Anyway, the PR is ready, it has a visible impact on at least the generated assembly which is notably smaller by allowing to skip all Formatter code. It doesn't change the behaviour of optimize_for_size and adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.

A lot less of assembly code is generated, however, just this difference is not enough to know the impact on performance. I wrote benchmarks but this is not my specialty and all I could get was a 1-2% performance difference (which is already significant, but did I write the benches correctly? That's another question).

Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.

Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.

I'm not convinced either but since the optimize_for_size feature flag exists, I need to deal with it. As for test coverage, no idea for optimize_for_size but the conversions are tested in the "normal" case. But in any case, this code doesn't change the behaviour of optimize_for_size so on this side we're good.

This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.

adds a very small amount of code. Having this kind of small optimization in places like integers to string optimization is always very welcome.

I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).

Can you provide these benchmarks and the raw numbers you produced? Perhaps add them as benches to the code, so they can be run by others, or extend rustc-perf's runtime benchmark suite.

Considering how specific this is, not sure it's worth adding to rustc-perf. As for adding them into the codebase, I'll need to ensure that they're correctly written first.

Here is the code I used:

#![feature(test)] extern crate test; use test::{Bencher, black_box}; #[inline(always)] fn convert_to_string<T: ToString>(n: T) -> String { n.to_string() } macro_rules! decl_benches { ($($name:ident: $ty:ident,)+) => { $( #[bench] fn $name(c: &mut Bencher) { c.iter(|| convert_to_string(black_box({ let nb: $ty = 20; nb }))); } )+ } } decl_benches! { bench_u8: u8, bench_i8: i8, bench_u16: u16, bench_i16: i16, bench_u32: u32, bench_i32: i32, bench_u64: u64, bench_i64: i64, }

The results are:

name 1.87.0-nightly (3ea711f 2025-03-09) With this PR diff

bench_i16 32.06 ns/iter (+/- 0.12) 17.62 ns/iter (+/- 0.03) -45%

bench_i32 31.61 ns/iter (+/- 0.04) 15.10 ns/iter (+/- 0.06) -52%

bench_i64 31.71 ns/iter (+/- 0.07) 15.02 ns/iter (+/- 0.20) -52%

bench_i8 13.21 ns/iter (+/- 0.14) 14.93 ns/iter (+/- 0.16) +13%

bench_u16 31.20 ns/iter (+/- 0.06) 16.14 ns/iter (+/- 0.11) -48%

bench_u32 33.27 ns/iter (+/- 0.05) 16.18 ns/iter (+/- 0.10) -51%

bench_u64 31.44 ns/iter (+/- 0.06) 16.62 ns/iter (+/- 0.21) -47%

bench_u8 10.57 ns/iter (+/- 0.30) 13.00 ns/iter (+/- 0.43) +22%

I have to admit I'm a bit surprised as I didn't remember the difference to be this big... But in any case, seeing how big the difference is, I wonder if the benches are correctly written (hence why I asked help for it).

Smaller assembly is (as you say) no real indicator of performance (though is nice) so I'm not sure it really means much by itself.

Yep, hence why I wrote benches. :)

This PR is still adding implementations that could get called (regardless of optimize_for_size) that didn't exist before it (taking us from 2 to 4 impls IIUC). Can you point concretely at some test coverage for each of the 4 impls (source links)? If not, then we really ought to add it, especially when there's a bunch of specialization involved in dispatch.

There is no complete test for this as far as I can see, but some small checks like tests/ui/traits/to-str.rs and test_simple_types in library/alloc/tests/string.rs.

Might be worth adding one?

I disagree with this assertion. We have to balance the cost of maintenance, and while this code is important, it sounds like we don't actually hit the itoa perf anyway with these changes. It's nice to be a bit faster, but I'm not convinced that a few percent is worth an extra 2 code paths (presuming I counted correctly) in this code, especially with rust-lang/libs-team#546 / #138215 expected to come soon and possibly add 2 more paths (or at least become the "high performance" path).

I can help with maintaining this code. The plan is not to be as good as itoa (which isn't possible anyway with the current API) but to be as good as possible in the current limitations. The improvements seem noticeable and I think are worth it. I also think that doing integers to string conversion is very common, and I even if we provide new APIs to handle them (which would be nice), this code is common enough to make a nice impact in existing codebases.

GuillaumeGomez · 2025-03-10T15:34:17Z

I also rebased the code (did it to match the nightly I used to compare for the benches).

rustbot assigned Mark-Simulacrum Jan 29, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 29, 2025

This comment has been minimized.

Sign in to view

rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 30, 2025

This comment has been minimized.

Sign in to view

rustbot removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 30, 2025

GuillaumeGomez force-pushed the optimize-integers-to-string branch from 83dc76e to 1713773 Compare January 30, 2025 17:27

GuillaumeGomez force-pushed the optimize-integers-to-string branch from 1713773 to 86056b0 Compare February 5, 2025 13:46

GuillaumeGomez marked this pull request as ready for review February 5, 2025 15:18

GuillaumeGomez force-pushed the optimize-integers-to-string branch from 86056b0 to 372d69d Compare February 26, 2025 20:05

Mark-Simulacrum reviewed Mar 1, 2025

View reviewed changes

GuillaumeGomez added 2 commits March 10, 2025 16:06

Optimize ToString implementation for integers

3a3c15a

Update weirdly failing ui tests

bf99f6f

GuillaumeGomez force-pushed the optimize-integers-to-string branch from 86a3315 to bf99f6f Compare March 10, 2025 15:33

name	1.87.0-nightly (`3ea711f` 2025-03-09)	With this PR	diff
bench_i16	32.06 ns/iter (+/- 0.12)	17.62 ns/iter (+/- 0.03)	-45%
bench_i32	31.61 ns/iter (+/- 0.04)	15.10 ns/iter (+/- 0.06)	-52%
bench_i64	31.71 ns/iter (+/- 0.07)	15.02 ns/iter (+/- 0.20)	-52%
bench_i8	13.21 ns/iter (+/- 0.14)	14.93 ns/iter (+/- 0.16)	+13%
bench_u16	31.20 ns/iter (+/- 0.06)	16.14 ns/iter (+/- 0.11)	-48%
bench_u32	33.27 ns/iter (+/- 0.05)	16.18 ns/iter (+/- 0.10)	-51%
bench_u64	31.44 ns/iter (+/- 0.06)	16.62 ns/iter (+/- 0.21)	-47%
bench_u8	10.57 ns/iter (+/- 0.30)	13.00 ns/iter (+/- 0.43)	+22%

Optimize ToString implementation for integers #136264

Are you sure you want to change the base?

Optimize ToString implementation for integers #136264

Conversation

GuillaumeGomez commented Jan 29, 2025

rustbot commented Jan 29, 2025

rustbot commented Jan 29, 2025

This comment has been minimized.

GuillaumeGomez commented Jan 29, 2025

workingjubilee commented Jan 30, 2025

theemathas commented Jan 30, 2025

theemathas commented Jan 30, 2025 • edited Loading

workingjubilee commented Jan 30, 2025

GuillaumeGomez commented Jan 30, 2025

This comment has been minimized.

bors commented Jan 30, 2025

bors commented Jan 30, 2025

This comment has been minimized.

rust-timer commented Jan 30, 2025

Overall result: ✅ improvements - no action needed

workingjubilee commented Jan 30, 2025

GuillaumeGomez commented Jan 30, 2025

workingjubilee commented Jan 30, 2025

GuillaumeGomez commented Jan 30, 2025

workingjubilee commented Jan 30, 2025

GuillaumeGomez commented Jan 30, 2025

GuillaumeGomez commented Jan 30, 2025

workingjubilee commented Jan 30, 2025

bors commented Feb 4, 2025

GuillaumeGomez commented Feb 5, 2025

Mark-Simulacrum commented Feb 9, 2025

rustbot commented Feb 9, 2025

tgross35 commented Feb 10, 2025

tgross35 commented Feb 10, 2025

GuillaumeGomez commented Feb 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GuillaumeGomez commented Mar 10, 2025

Optimize `ToString` implementation for integers #136264

Optimize `ToString` implementation for integers #136264

theemathas commented Jan 30, 2025 •

edited

Loading