Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider defaulting to linkshared for libraries when bazel-building on Linux to improve debug build performance #5752

Closed
EricCousineau-TRI opened this issue Apr 6, 2017 · 12 comments
Assignees
Labels

Comments

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented Apr 6, 2017

If building a set of tests or binaries in debug mode using bazel, you may notice slow read warnings such as this when linking or executing binaries:

Linking drake/examples/contact_model/rigid_grippSlow read: a 799007872-byte read from .../execroot/drake-distro/bazel-out/clang-3.9-linux-dbg/bin/drake/examples/schunk_wsg/schunk_wsg_simulation took 5999ms.

This could possibly be mitigated by defaulting to shared libraries on Linux.
Spoke to @david-german-tri, and the rationale for using static libraries was to make things simpler on Mac OSX and avoid issues such as #4952 and bazelbuild/bazel#507.
He mentioned that this could be accomplished with a select switch inside the drake_cc_* macros in tools/drake.bzl.

The workaround in this case is to follow the standard workflow to generally build in release mode (without -c dbg in your ~/.bazelrc or bazel command), and only build in debug mode when needed.

@jwnimmer-tri
Copy link
Collaborator

I've also had this on my mind for a while; glad to have it in issue for now.

I've wondered if we might be able to nibble away at this, by e.g. making RigidBodyTree and/or RigidBodyPlant (and maybe the parsers) into shared libraries, but leaving other things alone. Their object code is exceedingly huge, so once we shove it out of the way, everything else might be okay for a while.

I'm not sure if getting 1..3 libraries working is any easier or harder that getting them all working, but maybe it's a worthwhile experiment.

@EricCousineau-TRI
Copy link
Contributor Author

Related: #5900

@jwnimmer-tri
Copy link
Collaborator

All issues must have owners; feel free to reassign elsewhere as needed (or close), etc.

@jwnimmer-tri
Copy link
Collaborator

This is floating nearer to the top of things I might work on. What I'm stuck on now is some good benchmarking regime.

Which conops should we work on? Cold cache building from scratch? Change something in /drake/common and then bazel test //...? (Similarly for say, //drake/solvers?) Or just bazel build //... instead of test? Given the aggressive parallelization, in past experiments like this with Bazel, I've seen confounding factors that made it difficult to perceive the performance gains.

Which machine types should we benchmark? The TRI 48 vCPU workstations? Lenovo laptops? Some representative 4 vCPU student laptop? (I'm assuming we'll keep the optimizations guarded as Linux-only, given the OS X fragility.)

@sherm1
Copy link
Member

sherm1 commented Jun 9, 2017

This should have a more noticeable effect for Debug builds since the generated static libraries are much bigger.

I suspect the result will be very noticeable in my 8 CPU VMWare Ubuntu with limited RAM and disk. I can run smallish benchmarks but I can't actually build all of Drake from scratch in this box without crashing if all the RigidBodyTree mega-modules try to compile in parallel.

@jwnimmer-tri
Copy link
Collaborator

Proposed scenarios:

  • (1) Cold rebuild from scratch.
  • (2) Change a single cc file relatively low in the build, such as common or systems/framework.
  • (3) Change a single h file relatively low in the build, such as common or systems/framework.
  • (4) Change a single h file in drake/examples.

For each scenario, separately measure:

  • (a) bazel build //...,
  • (b) bazel test //... (i.e., test & build concurrently, not as a second step).

Proposed configurations:

  • (1) -c opt --define=NO_IPOPT=ON.
  • (2) -c dbg --define=NO_IPOPT=ON.

Proposed platforms:

  • (1) TRI's T460p laptop (8 vCPU, 32 GB RAM), Ubuntu 16.04, Clang.
  • (2) TRI's workstation (48 vCPU, 128 GB RAM), Ubuntu 16.04, Clang.

@jwnimmer-tri
Copy link
Collaborator

If I'm reading bazelbuild/bazel#492 correctly, if we remove linkstatic = 1 from some cc_library rules, then our cc_tests will load the shared library, but our cc_binarys will still link statically. When experimenting with this issue, we should pay attention to how the demo programs in drake/examples are being linked, and possibly rephrase their build rules in order to make best use of any improvements.

@jwnimmer-tri
Copy link
Collaborator

For the record, I think this will get bumped off my plate, though hopefully the ideas above are a good starting point. Might also be worth testing using the newest released Bazel, instead of 0.4.5.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Sep 7, 2017

I started a few experiments with this today. A few hints to help guide the exploration:

  1. --define=NO_IPOPT=ON was intended to remove the confounding IPOPT autotools effects, but of course the tests no longer pass when we do that. For now, I will focus on bazel build (without running tests) because of that. We can secondarily assess the effect on bazel test once we have a sense of the build-only times.
  2. At least in my first test configuration, I didn't see much difference in -c opt mode; I suspect we should focus on -c dbg more as the primary measure.
  3. The libdrake.so rule, :install-related rules, and pydrake-related rules will probably need to be measured separately from the rest of the build; they appear to substantially confound the measurement (by triggering static build steps that we are trying to disable). I will plan to disable all of those rules during the first measurement pass as well.

@EricCousineau-TRI
Copy link
Contributor Author

Per #6760, it looks like this may be an easy switch over. The main caveat to look out for is when deploying things via install(...), the appropriate binaries should be used; the additional challenge here is trying to carve these dependencies and still test them appropriately.

EricCousineau-TRI added a commit to EricCousineau-TRI/drake that referenced this issue Jun 4, 2018
This is a temporary solution pending a more long-term solution to have
`find_resource` be its own shared library (RobotLocomotion#5752 + RobotLocomotion#7294); it is difficult
at present to carve out `find_resource` as it has upstream dependencies that
would need to be carved out into shared libraries.
EricCousineau-TRI added a commit to EricCousineau-TRI/drake that referenced this issue Jun 4, 2018
This is a temporary solution pending a more long-term solution to have
`find_resource` be its own shared library (RobotLocomotion#5752 + RobotLocomotion#7294); it is difficult
at present to carve out `find_resource` as it has upstream dependencies that
would need to be carved out into shared libraries.
@jwnimmer-tri jwnimmer-tri removed their assignment Jun 14, 2018
@jwnimmer-tri jwnimmer-tri added the component: build system Bazel, CMake, dependencies, memory checkers, linters label Apr 28, 2020
@EricCousineau-TRI
Copy link
Contributor Author

Closing this for now. This is a low-level incantation of trying to address #6760; best handled by that issue.

@EricCousineau-TRI
Copy link
Contributor Author

Er, I guess also for size optimization, but it has not been readily cited as an acute pain point yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants