Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bazel fails for cache deadline exceeded and cache unavailable errors #24120

Open
hauserx opened this issue Oct 28, 2024 · 0 comments
Open

Bazel fails for cache deadline exceeded and cache unavailable errors #24120

hauserx opened this issue Oct 28, 2024 · 0 comments
Assignees
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug

Comments

@hauserx
Copy link
Contributor

hauserx commented Oct 28, 2024

Description of the bug:

Bazel seems to be unable to recover if inputs to a given action cannot be downloaded from cache, while it recovers in other cases.

Bazel fails with either:

  • Failed to fetch blobs because of a remote cache error.: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED
  • Failed to fetch blobs because of a remote cache error.: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception

Tested with a few versions of bazel (7.1.1, 7.4.0, last_green)
On last_green version bazel actually retries actions possibly due to --experimental_remote_cache_eviction_retries flag, but always fails if above issues re-appear.

Included examples where bazel always fails.
Note that none of those options helps in the above example: --remote_local_fallback, --experimental_remote_cache_eviction_retries.

Which category does this issue belong to?

Remote Execution

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

MODULE.bazel:

module(name = "simple", version = "1.0")

BUILD:

genrule(
    name = "first",
    outs = ["first.txt"],
    cmd = "echo -n first_content > \"$@\"",
)

genrule(
    name = "second",
    srcs = ["first.txt"],
    outs = ["second.txt"],
    cmd = "echo -n second_content > \"$@\"",
    tags = ["no-cache"],
)

Patched version of remote-cache https://github.com/buchgr/bazel-remote that just sleeps on blob with "first_content":

$ git diff
diff --git a/server/grpc_bytestream.go b/server/grpc_bytestream.go
index b24969c..856f13b 100644
--- a/server/grpc_bytestream.go
+++ b/server/grpc_bytestream.go
@@ -7,6 +7,7 @@ import (
        "io"
        "strconv"
        "strings"
+       "time"

        "google.golang.org/genproto/googleapis/bytestream"
        "google.golang.org/grpc/codes"
@@ -53,6 +54,12 @@ func (s *grpcServer) Read(req *bytestream.ReadRequest,
                return err
        }

+       if hash == "b8219593f9d14091f88892f0861f4e3af53c4db510950d32fae459241fc83017" {
+               s.accessLogger.Printf("---------- SLEEP START --------------- %s", req.ResourceName)
+               time.Sleep(2 * time.Minute)
+               s.accessLogger.Printf("---------- SLEEP STOP --------------- %s", req.ResourceName)
+       }
+
        if size == 0 {
                if cmp == casblob.Identity {
                        s.accessLogger.Printf("GRPC BYTESTREAM READ COMPLETED %s", req.ResourceName)

Running cache:

bazel run //:bazel-remote -- --dir /tmp/bazel-cache --max_size 10000 --grpc_address 127.0.0.1:9092

Running bazel:

bazel clean && bazel build :second --remote_cache="grpc://127.0.0.1:9092" --remote_timeout=1s

On second execution of bazel:

ERROR: /home/hauser/work/bazel_remote_cache_fail_issues/BUILD:7:8: Executing genrule //:second failed: Failed to fetch blobs because of a remote cache error.: io.grpc.StatusRuntimeException: DEADLINE_EXCEEDED: deadline exceeded after 0.999691193s. [closed=[], open=[[buffered_nanos=74747, remote_addr=127.0.0.1/127.0.0.1:9092]]]

For cache that fails during blob retireval, using this patch of bazel-remote:

$ git diff
diff --git a/server/grpc_bytestream.go b/server/grpc_bytestream.go
index b24969c..feaf024 100644
--- a/server/grpc_bytestream.go
+++ b/server/grpc_bytestream.go
@@ -5,6 +5,7 @@ import (
        "errors"
        "fmt"
        "io"
+       "os"
        "strconv"
        "strings"

@@ -53,6 +54,10 @@ func (s *grpcServer) Read(req *bytestream.ReadRequest,
                return err
        }

+       if hash == "b8219593f9d14091f88892f0861f4e3af53c4db510950d32fae459241fc83017" {
+               os.Exit(42)
+       }
+
        if size == 0 {
                if cmp == casblob.Identity {
                        s.accessLogger.Printf("GRPC BYTESTREAM READ COMPLETED %s", req.ResourceName)

Running bazel:

ERROR: /home/hauser/work/bazel_remote_cache_fail_issues/BUILD:7:8: Executing genrule //:second failed: Failed to fetch blobs because of a remote cache error.: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception

Which operating system are you running Bazel on?

Ubuntu 20.04

What is the output of bazel info release?

release 7.4.0

Have you found anything relevant by searching the web?

No local fallback after cache timeout - This is possibly related issue, possibly this one is duplicate (although error messages are slightly different)
Rethink spawn strategies - A bag of ideas how caching can change going forward

@github-actions github-actions bot added the team-Remote-Exec Issues and PRs for the Execution (Remote) team label Oct 28, 2024
@coeuvre coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed more data needed untriaged labels Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 We'll consider working on this in future. (Assignee optional) team-Remote-Exec Issues and PRs for the Execution (Remote) team type: bug
Projects
None yet
Development

No branches or pull requests

6 participants