Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build failing on remote cache problems unexpectedly #22119

Closed
guw opened this issue Apr 25, 2024 · 9 comments
Closed

Build failing on remote cache problems unexpectedly #22119

guw opened this issue Apr 25, 2024 · 9 comments
Assignees
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@guw
Copy link
Contributor

guw commented Apr 25, 2024

Description of the bug:

Our build is unreliable. The culprit seems to be remote cache problems (we use Google Cloud storage).

(22:37:24) ERROR: /...BUILD.bazel:3:17: Building ....jar (229 source files, 1 source jar) failed: unable to finalize action: Connection reset
...
(22:37:25) ERROR: Build did NOT complete successfully

This is unexpected because we have the following in .bazelrc:

common --remote_local_fallback

Remote cache reliability issues should not impact the Bazel build. Especially intermittent network issues should not fail a Bazel build. Those are expensive. If a cache upload or download fails the build should consider the remote cache unreliable and continue without problems.

Which operating system are you running Bazel on?

Linux

What is the output of bazel info release?

release 7.1.1

@guw guw changed the title Build failing on remote cache problems with --remote_local_fallback Build failing on remote cache problems unexpectedly Apr 25, 2024
@iancha1992 iancha1992 added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Apr 25, 2024
@iancha1992
Copy link
Member

@guw Could you please provide complete steps to reproduce this issue?

@guw
Copy link
Contributor Author

guw commented Apr 26, 2024

@iancha1992 I am not sure how. This seems to rely on network issues within Google Cloud. All we do is running bazel build //... from within Google Cloud compute instance with remote cache being a GCS bucket.

We did notice a detail, it seems to be failing only when compiling unit tests (within java_test). We see multiple connection reset problems in the build logs for other compiles and most seem to be recovering.

Example:

(19:24:39) WARNING: Remote Cache: Connection reset
 com.google.devtools.build.lib.remote.common.BulkTransferException: Connection reset
 	at com.google.devtools.build.lib.remote.util.RxUtils$BulkTransferExceptionCollector.onResult(RxUtils.java:112)
 	at io.reactivex.rxjava3.internal.operators.flowable.FlowableCollectSingle$CollectSubscriber.onNext(FlowableCollectSing
...
 	Suppressed: java.net.SocketException: Connection reset
 		at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:401)
 		at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:434)
 		at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:256)
 		at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
 		at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:357)
 		at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
 		... 8 more

        but it still worked:
(19:26:29) INFO: Elapsed time: 684.726s, Critical Path: 199.22s
(19:26:29) INFO: 86024 processes: 43402 remote cache hit, 24489 internal, 16892 processwrapper-sandbox, 1241 worker.
(19:26:29) INFO: Build completed successfully, 86024 total actions

But the one that fail are usually compiling a unit test class and they also don't print a stacktrace.

@meisterT meisterT added P1 I'll work on this now. (Assignee required) and removed untriaged labels Apr 30, 2024
@coeuvre
Copy link
Member

coeuvre commented May 21, 2024

I think this is a duplicate of #20123.

@guw
Copy link
Contributor Author

guw commented May 21, 2024

I think in this case some clarification is needed. #20123 mentions "unable to finalize action" problem but then later #22387 which seems to connect the issue to .d files only. This is confusing because we are compiling Java only.

@coeuvre
Copy link
Member

coeuvre commented May 21, 2024

"unable to finalize action" and .d issues are separated issues. As stated in #20123 (comment), --remote_local_fallback was not designed for the case of local execution + remote cache. Making it works requires structural changes to the code and will be tracked at #20123.

@guw
Copy link
Contributor Author

guw commented May 21, 2024

Is "unable to finalize action" failing the build when remote cache fails expected at this point? The behavior we see is inconsistent. Sometimes its a warning and sometimes it fails.

@coeuvre
Copy link
Member

coeuvre commented May 21, 2024

it's inconsistent because an action can access the remote cache at different stages.

Before executing the action locally, Bazel first checks the AC in remote cache. If this check failed (e.g. connection reset), it treats it as CacheNotFound (as well as print the warning as in #22119 (comment)) and continue with local execution.

Otherwise, Bazel will retrieve the outputs from CAS in remote cache. If the connection is broken at this point, you will see build error in #22119 (comment). It would be nice to have it fallback to local execution, but it requires non-trivial changes.

@guw
Copy link
Contributor Author

guw commented May 21, 2024

Would it be possible to return with an exit code that encourages retries? We implemented a retry logic by parsing the Bazel output for unable to finalize action: Connection reset, which works quite well. But might break when the output string changes.

https://bazel.build/run/scripts

@coeuvre
Copy link
Member

coeuvre commented May 22, 2024

I am inclined to not add an exit code only for this specific case. I would like to fix the fundamental problem instead, which will be tracked in #20123 and #19904. Closing.

@coeuvre coeuvre closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 I'll work on this now. (Assignee required) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants