Support PGO in the compiler's build system #79562

michaelwoerister · 2020-11-30T14:32:58Z

One of the first steps in bringing PGO to rustc is adding support for it to the compiler's build system. Concretely this means

support building an instrumented compiler (-Cprofile-generate phase), and
support building a compiler that makes use of profile data (-Cprofile-use phase)

Optionally, the build system could also

support profile data collection (by running the instrumented compiler through a standard set of workloads)

In rust-lang/cargo#7618 (comment), @luser suggests that the three phases should be operable separately. I think that makes a lot of sense since on CI we'll probably want to either only run the "profile-use" phase with downloaded profile data, or only the "profile-generate" phase that uploads the profile data.

I suggest adding the following new config.toml settings:

llvm.profile-generate = ( "<path>" | "default-path" )
llvm.profile-use = ( "<path>" | "default-path" | "download-profdata" )
rust.profile-generate = ( "<path>" | "default-path" )
rust.profile-use = ( "<path>" | "default-path" | "download-profdata" )
rust.profile-collect = (true | false)

The paths specify the argument to the -Cprofile-generate and -Cprofile-use flags. default-path means some known location within the build directory. download-profdata means that the build system will try to find and download the most recent profile data available for the given commit. If rust.profile-collect is true, the build system will run a standard data collection scenario with the instrumented compiler (or will error if none of profile-generate flags is set).

Note that this setup has the following implications:

One can turn PGO on or off independently for LLVM and Rust.

There is no single command that runs all three phases at once. If one wants to build a PGOed compiler completely locally, one has to run something like:

./configure 
  --llvm.profile-generate=default-path
  --rust.profile-generate=default-path
  --llvm.profile-collect=true

./x.py build

./x.py clean # including LLVM?

./configure 
  --llvm.profile-use=default-path
  --rust.profile-use=default-path

./x.py build

The "default" scenario where "generate" and "use" phases are not run on the same machine can be done with a single ./configure && ./x.py build though.

I'd like to hear if somebody can come up with a different, better setup. Maybe something that allows the entire three-phase build with a single ./x.py pgo-build command?

The text was updated successfully, but these errors were encountered:

Mark-Simulacrum · 2020-11-30T14:44:20Z

I don't think we need to optimize for full collection + use, I'm not even sure how many people would use it if available -- it's just not really a common thing I expect.

One question is that for the Rust part of the PGO, I imagine we don't want to collect or use the PGO data for the first round of compilation, right? Since that'll have differing codegen from the second round (potentially even using a different LLVM version). That's fine -- rustbuild can handle that for us -- but something to keep in mind.

If we don't support simultaneously using and generating PGO data we should probably check that at startup but I'm not otherwise all that worried.

michaelwoerister · 2020-11-30T15:28:49Z

One question is that for the Rust part of the PGO, I imagine we don't want to collect or use the PGO data for the first round of compilation, right? Since that'll have differing codegen from the second round (potentially even using a different LLVM version). That's fine -- rustbuild can handle that for us -- but something to keep in mind.

Hm, I wonder if it would be possible to make stage 1 instrumented, use it to collect profile data, and then make stage 2 use that data. The profdata format looks rather robust, so that might even work for different LLVM versions. But I would not want to rely on it.

Are our release artifacts stage 2 builds? Then the instrumented compiler should also be a stage 2 build.

Mark-Simulacrum · 2020-11-30T16:35:19Z

Right, stage 2 is basically the artifacts we'd be using for this, though they're built as "stage 1 artifacts" -- the second rustc binary that gets produced. I'm not too worried about getting the language precise here, I think we have agreement on what we need to profile :)

I expect that we can build stage 0 artifacts as normal -- no need for special PGO handling, though eventually the beta they're built with will be PGO'd so that should give a nice speed up. Then the second round of compilation (stage 1 artifacts) is built, instrumented, we run our workloads to gather data, and then rebuild the second round of compilation using that data. This gives us a single compiler rebuild in terms of overhead.

Thinking along those lines, perhaps the right thing is actually for rustbuild to always do this - i.e., the pgo-build option you suggest is not actually necessary. Specifically, rustbuild will collect data if the profile-generate options are given and use it with profile-use; just profile-use will only use it, and both will do the full thing.

luser · 2020-11-30T18:07:14Z

Hm, I wonder if it would be possible to make stage 1 instrumented, use it to collect profile data, and then make stage 2 use that data. The profdata format looks rather robust, so that might even work for different LLVM versions. But I would not want to rely on it.

If this works on a technical level with the profile data format then this sounds like a pretty great idea. You've already got a multi-phase build, so reusing one of those phases to generate the profile data would be a clever trick! Per my comment elsewhere, this would impact the reproducibility of the builds. To preserve that it would be nice to publish the profile data as a build artifact and provide rustbuild support for producing an optimized build given an existing set of profile data.

michaelwoerister · 2020-12-03T10:10:59Z

I expect that we can build stage 0 artifacts as normal -- no need for special PGO handling, though eventually the beta they're built with will be PGO'd so that should give a nice speed up. Then the second round of compilation (stage 1 artifacts) is built, instrumented, we run our workloads to gather data, and then rebuild the second round of compilation using that data. This gives us a single compiler rebuild in terms of overhead.

Let me see if I understand this correctly: Right now we have:

beta --> stage0 --> stage1
                    ^^^^^^
                     dist

that would become:

beta --> stage0 --> stage1 --> stage2
                    ^^^^^^     ^^^^^^
                instrumented    PGO-ed dist

So, PGO would mean one additional build, right?

michaelwoerister · 2020-12-03T10:44:52Z

I'm wondering if that isn't actually too complicated since we'd actually never use that build in CI. What I suggested originally would look like:

# case 1: generate profile data

beta --> stage0 --> stage1 =run=> profdata (for uploading)
                    ^^^^^^
               dist (instrumented)


# case 2: use profile data (downloaded, from some previous build)

beta --> stage0 --> stage1
                    ^^^^^^
               dist (profile-use)

Cases (1) and (2) would always run in parallel, and case (2) would use profile data from a previous run of case (1). The advantage is that we don't have to worry about building LLVM twice.

michaelwoerister · 2020-12-03T15:49:59Z

Another wrinkle here that I just realized is that LLVM is built only once -- so for the profile-generate case we would work with an instrumented LLVM already when building the instrumented rustc. That should not be much of a problem though since instrumentation can be disabled at runtime via an environment variable, if I remember correctly. Or we just collect profile data for LLVM twice. The corresponding .profraw are automatically updated in place by the instrumentation runtime.

Mark-Simulacrum · 2020-12-14T14:55:34Z

I think your layout makes sense for the case where someone sets the profile use config but not profile generate. If both are set, I think it makes sense to do the layout in #79562 (comment). If sccache does not properly handle PGO flags for caching then we'll need to leave LLVM out of scope for now, otherwise I expect there should be no problem with building LLVM twice for the immediate reuse case. I am still a bit skeptical about using old profile data wrt to perf.rlo, but I think we can tackle that problem when we get to it.

luser · 2020-12-14T15:51:24Z

sccache should handle clang's PGO flags fine since Firefox builds use them. However, it won't cache the optimized -fprofile-instr-use compile, only the -fprofile-instr-generate one, so that may not help you as much as you'd like. It's possible that we could add support for caching those compiles to sccache by including the hash of the profile data in the cache key. We never considered that previously because the Firefox build always runs a profiling run for every PGO build so the profile data would never be the same for two different builds anyway.

Interestingly, reading the clang documentation reminded me that LLVM added support for using the output of a sampling profiler as the profiling input for PGO. This is likely out of scope as it's not interoperable with the instrumentation-based profile data and thus wouldn't fit neatly into what's already been implemented in rustc, but it's an interesting thing to consider since it doesn't require building instrumented binaries.

michaelwoerister · 2020-12-15T10:18:19Z

I am still a bit skeptical about using old profile data wrt to perf.rlo, but I think we can tackle that problem when we get to it.

I don't think we should use the PGOed compiler for perf.rlo. It would just be confusing otherwise.

michaelwoerister · 2020-12-15T10:31:08Z

However, it won't cache the optimized -fprofile-instr-use compile

Oh, good to know! Yes, support for that would have to be added to sccache. The -fprofile-instr-use case would be the more common one, I think.

michaelwoerister · 2020-12-15T10:38:18Z

I think your layout makes sense for the case where someone sets the profile use config but not profile generate. If both are set, I think it makes sense to do the layout in #79562 (comment).

I think I'd prefer if we just didn't support doing a PGOed build with a single ./x.py invocation. It's a rare use case and it should be easy enough to provide a helper script somewhere that executes the steps described in the OP, right?

Mark-Simulacrum · 2020-12-15T15:47:03Z

While working on #80033, I ran into LLVM Profile Warning: Unable to track new values: Running out of static counters. Consider using option -mllvm -vp-counters-per-site=<n> to allocate more value profile counters at compile time. - I ended up using 4 for the value, which seems to have fixed this for now, but it would be good to know what exactly we're specifying here (and what the tradeoffs are); ideally I'd want that warning to be an error I imagine?

luser · 2020-12-15T16:19:35Z

I don't think we should use the PGOed compiler for perf.rlo. It would just be confusing otherwise.

I owe you a longer-form writeup, but for Firefox we had perf testing on both PGO and non-PGO builds, and it was possible to compare the results between them. This was really useful because the PGO builds were what we were actually shipping to users, but being able to compare between the two was good and also allowed for quicker iteration when the PGO perf difference didn't matter. (IIRC enabling PGO builds was opt-in for Firefox try pushes.)

michaelwoerister · 2020-12-17T13:10:45Z

@luser Yes, it would be great if we could benchmark both PGO and non-PGO builds. If we can afford that (which is something @Mark-Simulacrum would know better than me) then all the better. If we had to choose one of the two, I'd opt for benchmarking just non-PGO builds since they would not contain the "random" effects of partially out-dated profiling data.

michaelwoerister · 2020-12-17T13:30:16Z

While working on #80033, I ran into LLVM Profile Warning: Unable to track new values: Running out of static counters. Consider using option -mllvm -vp-counters-per-site=<n> to allocate more value profile counters at compile time. - I ended up using 4 for the value, which seems to have fixed this for now, but it would be good to know what exactly we're specifying here (and what the tradeoffs are); ideally I'd want that warning to be an error I imagine?

This is what I've been able to find out so far: "value profiling" allows to record some of the actual values that a variable takes during execution. To do that the profiling runtime has to reserve some space in the profdata. Apparently one does not need many of these counters in the common case:

https://github.com/llvm/llvm-project/blob/37f99a56065209627020428cecb78f55dfa90580/llvm/lib/Transforms/Instrumentation/InstrProfiling.cpp#L81-L84

It looks like an overflow here can safely be ignored. It might just lead to less accurate profile data:

https://github.com/llvm/llvm-project/blob/e7a3c4c11e84ba99c3682ae6cf20c398f16cf3f5/compiler-rt/lib/profile/InstrProfilingValue.c#L210-L212

But it also looks like increasing the number of counters via -vp-counters-per-site should not have much of an adverse effect either, except for a little more memory being used by the instrumented binary.

Mark-Simulacrum · 2020-12-18T22:13:21Z

Ok, useful to know. I don't feel like it's critical that we avoid the warnings but there seems to not be any significant disadvantage to bumping slightly so I'll leave that in the code for now.

michaelwoerister · 2021-01-20T11:42:41Z

@luser What's the best way of getting support for -fprofile-use into sccache? Should I open an issue? Or try to implement it myself and make a PR?

luser · 2021-01-25T18:57:44Z

Opening an issue would be a good start, but the project is a bit under-maintained right now so it's likely that doing the work yourself would be the quickest path to success. I'd be happy to give you pointers on how to implement it. I'm on the Rust Discord as tedmielczarek, feel free to drop me a line.

michaelwoerister · 2021-02-16T13:19:11Z

FYI: I've made a PR to sccache that implements the missing functionality. I haven't heard back from the maintainers yet.

FilipAndersson245 · 2021-04-20T20:39:41Z

Any update on this issue?

michaelwoerister · 2021-04-21T11:48:52Z

The required sccache changes have not been merged yet.

Mark-Simulacrum · 2021-09-23T13:43:42Z

We ended up foregoing the caching for the time being; LLVM and rustc are both PGO'd on x86_64-unknown-linux-gnu as of #88069.

The sccache changes would still be helpful, though.

jyn514 · 2022-07-11T00:54:15Z

The sccache changes have been merged.

Mark-Simulacrum · 2022-07-11T02:00:34Z

I think at this point this issue can be closed; we've long had PGO for x86_64-unknown-linux-gnu, and are in the process of adding it to Windows (#96978).

#79442 is a better issue here regardless, with a much more complete task list.

michaelwoerister mentioned this issue Nov 30, 2020

Exploring PGO for the Rust compiler #79442

Open

16 tasks

jyn514 added the WG-compiler-performance Working group: Compiler Performance label Nov 30, 2020

Mark-Simulacrum removed the T-release Relevant to the release subteam, which will review and decide on the PR/issue. label Nov 30, 2020

michaelwoerister mentioned this issue Feb 11, 2021

Support caching Clang invocations with -fprofile-use and -fprofile-instr-use. mozilla/sccache#952

Merged

Mark-Simulacrum closed this as completed Jul 11, 2022

davidhewitt mentioned this issue Nov 6, 2023

PGO build for MacOS M1 pydantic/pydantic-core#1063

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support PGO in the compiler's build system #79562

Support PGO in the compiler's build system #79562

michaelwoerister commented Nov 30, 2020

Mark-Simulacrum commented Nov 30, 2020

michaelwoerister commented Nov 30, 2020

Mark-Simulacrum commented Nov 30, 2020

luser commented Nov 30, 2020

michaelwoerister commented Dec 3, 2020

michaelwoerister commented Dec 3, 2020 •

edited

Loading

michaelwoerister commented Dec 3, 2020

Mark-Simulacrum commented Dec 14, 2020

luser commented Dec 14, 2020

michaelwoerister commented Dec 15, 2020

michaelwoerister commented Dec 15, 2020

michaelwoerister commented Dec 15, 2020

Mark-Simulacrum commented Dec 15, 2020 •

edited

Loading

luser commented Dec 15, 2020

michaelwoerister commented Dec 17, 2020

michaelwoerister commented Dec 17, 2020 •

edited

Loading

Mark-Simulacrum commented Dec 18, 2020

michaelwoerister commented Jan 20, 2021

luser commented Jan 25, 2021

michaelwoerister commented Feb 16, 2021

FilipAndersson245 commented Apr 20, 2021

michaelwoerister commented Apr 21, 2021

Mark-Simulacrum commented Sep 23, 2021

jyn514 commented Jul 11, 2022

Mark-Simulacrum commented Jul 11, 2022

Support PGO in the compiler's build system #79562

Support PGO in the compiler's build system #79562

Comments

michaelwoerister commented Nov 30, 2020

Mark-Simulacrum commented Nov 30, 2020

michaelwoerister commented Nov 30, 2020

Mark-Simulacrum commented Nov 30, 2020

luser commented Nov 30, 2020

michaelwoerister commented Dec 3, 2020

michaelwoerister commented Dec 3, 2020 • edited Loading

michaelwoerister commented Dec 3, 2020

Mark-Simulacrum commented Dec 14, 2020

luser commented Dec 14, 2020

michaelwoerister commented Dec 15, 2020

michaelwoerister commented Dec 15, 2020

michaelwoerister commented Dec 15, 2020

Mark-Simulacrum commented Dec 15, 2020 • edited Loading

luser commented Dec 15, 2020

michaelwoerister commented Dec 17, 2020

michaelwoerister commented Dec 17, 2020 • edited Loading

Mark-Simulacrum commented Dec 18, 2020

michaelwoerister commented Jan 20, 2021

luser commented Jan 25, 2021

michaelwoerister commented Feb 16, 2021

FilipAndersson245 commented Apr 20, 2021

michaelwoerister commented Apr 21, 2021

Mark-Simulacrum commented Sep 23, 2021

jyn514 commented Jul 11, 2022

Mark-Simulacrum commented Jul 11, 2022

michaelwoerister commented Dec 3, 2020 •

edited

Loading

Mark-Simulacrum commented Dec 15, 2020 •

edited

Loading

michaelwoerister commented Dec 17, 2020 •

edited

Loading