Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline String::utf8 and String::utf16 for their simplicity. #101356

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Ivorforce
Copy link
Member

Continuing from #101352 (comment) - might as well get it out of the way now.

The functions are trivial, and inlining them allows the compiler to reason better about them. Technically it's a tiny optimization since the compiler may now optimize the intermediate String away (though it won't make a practical difference).

@Ivorforce Ivorforce requested a review from a team as a code owner January 9, 2025 16:15
@@ -523,11 +523,19 @@ class String {
CharString ascii(bool p_allow_extended = false) const;
CharString utf8() const;
Error parse_utf8(const char *p_utf8, int p_len = -1, bool p_skip_cr = false);
static String utf8(const char *p_utf8, int p_len = -1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe _FORCE_INLINE_, even if the compiler will probably do it either way

Copy link
Member Author

@Ivorforce Ivorforce Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm not a big fan of _FORCE_INLINE_; for most purposes I think it is unnecessary (the compiler will usually inline trivial functions like this anyway).

But you're right, I do know it's quite common in the codebase and often requested by reviewers. I'll leave this open for now to let others weigh in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be useful sometime. For example, debug build will not inline them (debug build do not inline anything by default), but if your debug build is too slow, and what you want to debug is not in an inline function, you can force all those FORCE_INLINE to inline even in debug with /d2Obforceinline or /Ob1 (in MSVC) to regain some speed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point actually. Making non-optimizing builds bearable in speed is a worthwhile endeavour.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not a big fan of _FORCE_INLINE_, I think Godot overuses it in general and it should be more targeted, there is little considerations of the downsides.

But I did do a test a few years ago and runtime performance was admittedly a tad faster with it overall (versus being macrod out).

Note it's compiled out in DEV builds.

@Nazarwadim
Copy link
Contributor

LTO does inlining, so I don't think it's necessary.

@Ivorforce
Copy link
Member Author

Ivorforce commented Jan 9, 2025

LTO does inlining, so I don't think it's necessary.

macOS, iOS, Android and MSVC builds do not have LTO enabled. Plus, by the time LTO comes around, many opportunities for optimization have already passed, so at that stage it's much less guaranteed to actually help.

@Nazarwadim
Copy link
Contributor

macOS, iOS, Android and MSVC builds do not have LTO enabled. Plus, by the time LTO comes around, many opportunities for optimization have already passed, so at that stage it's much less guaranteed to actually help.

LTO can be used on these platforms.
macOS, iOS, Android LTO is available through Clang.
MSVC has Link Time Code Generation. I don't know if Godot uses it, but it definitely exists.

Plus, by the time LTO comes around, many opportunities for optimization have already passed

I previously used _FORCE_INLINE_ for Vector2 #88935 and godot benchmarks on LTO builds only showed performance degradation due to increased binary size -> more cache misses.

I also ran micro benchmarks with LTO and the linker optimized it at the same level.

@Ivorforce
Copy link
Member Author

Ivorforce commented Jan 9, 2025

LTO can be used on these platforms. macOS, iOS, Android LTO is available through Clang. MSVC has Link Time Code Generation. I don't know if Godot uses it, but it definitely exists.

This discussion is outside the scope of this PR. I do agree that checking whether LTO should be enabled on other platforms might be warranted. Feel free to make a godot-proposal for this. I can point you in the right direction - searching for env["lto"] = "none" should get you to all 4 spots it's been explicitly disabled.

I previously used _FORCE_INLINE_ for Vector2 #88935 and godot benchmarks on LTO builds only showed performance degradation due to increased binary size -> more cache misses.

For most situations, adding or removing _FORCE_INLINE_ should not result in observable changes in performance. It's a micro optimization that should only be considered for the most extreme of cases. How cache misses in unrelated spots due to overwrites will affect the final performance of a binary is extremely hard to measure. More likely, you'll make mistakes in your setup than measure something meaningful.

I also ran micro benchmarks with LTO and the linker optimized it at the same level.

I suppose this is expected in most situations, but again - by LTO time, a lot of opportunities for optimizations have passed. Even if the functions were compiled to the same binary in some situations, it may unexpectedly affect others. LTO is not a 'it behaves as though both are the same module' drop-in thing.

@Repiteo Repiteo added this to the 4.x milestone Jan 9, 2025
@Nazarwadim
Copy link
Contributor

Hah, I don't want to get into all this micro-optimization.

But maybe you didn't look well.
LTO is disabled by default and enabled when we gave lto=full/thin.

if env["lto"] == "auto": # Disable by default as it makes linking in Xcode very slow.
env["lto"] = "none"
if env["lto"] != "none":
if env["lto"] == "thin":
env.Append(CCFLAGS=["-flto=thin"])
env.Append(LINKFLAGS=["-flto=thin"])
else:
env.Append(CCFLAGS=["-flto"])
env.Append(LINKFLAGS=["-flto"])

In general, I'm against these inlines because they increase compile time and are almost useless since LTO already does it.

@Ivorforce
Copy link
Member Author

LTO is disabled by default and enabled when we gave lto=full/thin.

Sure, but afaik this is not performed in official builds :)

In general, I'm against these inlines because they increase compile time and are almost useless since LTO already does it.

If LTO inlines anyway then forcing or encouraging an inline won't affect compile time.

@Nazarwadim
Copy link
Contributor

Sure, but afaik this is not performed in official builds :)

Ah. Now I see it here.
https://github.com/godotengine/godot-build-scripts/blob/e7210ecc92b17f38b4a607b755cb01576bd95f79/build-ios/build.sh#L8

@Nazarwadim
Copy link
Contributor

If LTO inlines anyway then forcing or encouraging an inline won't affect compile time.

I mean most builds are done without LTO to see if everything works, run unit tests and check for regressions.

@akien-mga
Copy link
Member

Just to be clear, official builds are made with production=yes, so whether or not LTO is used depends on how lto=auto is treated in each platform.

Depending on the platform and compiler we have mixed results with LTO, where it seems to be quite beneficial with GCC (increases performance and reduces binary size) while it's less clear cut with Clang (increases performance but also increases size significantly with heavy inlining, and Clang full LTO is egregiously slow - no parallel linking - while its ThinLTO makes binaries even bigger), and it's seamingly unusable with MSVC LTCG.

See #96785 and #96851.

@Ivorforce Ivorforce force-pushed the inline-utf-creation-functions branch from 33a34fe to 67cdb1d Compare March 11, 2025 13:23
Copy link
Member

@Calinou Calinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally, it works as expected. Code looks good to me.

Binary size for a Linux x86_64 stripped optimized build (production=yes lto=full) goes up by 8 KB for the editor, 4 KB for a release export template.

This PR is a bit faster to start up and shut down the editor on an empty project:

❯ hyperfine -m25 -iw1 "bin/godot.linuxbsd.editor.x86_64.master /tmp/4/project.godot --quit" "bin/godot.linuxbsd.editor.x86_64 /tmp/4/project.godot --quit"
Benchmark 1: bin/godot.linuxbsd.editor.x86_64.master /tmp/4/project.godot --quit
  Time (mean ± σ):      4.099 s ±  0.470 s    [User: 2.517 s, System: 0.555 s]
  Range (min … max):    3.661 s …  4.640 s    25 runs
 
Benchmark 2: bin/godot.linuxbsd.editor.x86_64 /tmp/4/project.godot --quit
  Time (mean ± σ):      3.994 s ±  0.434 s    [User: 2.522 s, System: 0.562 s]
  Range (min … max):    3.648 s …  4.654 s    25 runs

@Ivorforce
Copy link
Member Author

Ivorforce commented Mar 11, 2025

Binary size for a Linux x86_64 stripped optimized build (production=yes lto=full) goes up by 8 KB for the editor, 4 KB for a release export template.

That's odd, I wonder where that's coming from.
To be honest, I'd expect neither an observable speed up nor a binary size increase.

I would expect the functions to get inlined, String being initialized to 00 (nullptr), and a function call to be made. Since the previous version was basically the same (allocate String, call function), it shouldn't bring any size increases (and I'd expect the speed up to be negligible).

Perhaps it's a bunch of domino effects adding up to the difference? 🤔
I may investigate this as a follow up. Thanks for testing!

Edit: I recreated this locally. Seems like this PR is 13kb larger than master for me (macos, with debug symbols enabled in -O2). They measure at about the same speed for me (with master possibly being faster by 0.5%).
For me, it still makes sense to merge, because compilers can be finnicky with small numbers, and possibly there's stuff going on behind the scenes that affect the outcome unexpectedly. I'd love to know what it is, but i don't know the proper tools to dissect it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants