Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistake in a BackendTrafficPolicy causes all routes to return 404 #5147

Closed
dghubble opened this issue Jan 25, 2025 · 5 comments · Fixed by #5176
Closed

Mistake in a BackendTrafficPolicy causes all routes to return 404 #5147

dghubble opened this issue Jan 25, 2025 · 5 comments · Fixed by #5176
Assignees
Labels
area/xds-translator kind/bug Something isn't working
Milestone

Comments

@dghubble
Copy link

dghubble commented Jan 25, 2025

Description:

A colleage and I found that a subtle mistake in a single BackendTrafficPolicy can make envoy proxy instances return 404's for ALL routes.

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: hellogo
  namespace: default
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: hello
  retry:
    numRetries: 1
    perRetry:
      backOff:
        baseInterval: 0s      # 0s breaks everything, 1s is ok

Repro steps:

Create a BackendTrafficPolicy as shown above. Nothing stops a developer setting baseInterval: 0s.

At first, nothing is wrong. Then, if you restart envoy proxies, you'll find ALL httproutes return 404s immediately. Logs show route_not_found for all requests but no mention of why or which resources causs this. Inspecting the raw envoy config via the admin portal, the dynamic_route_configs section is never generated (usually its populated).

To find the offending resource, we had to delete resources until discovering the problematic thing was this one BackendTrafficPolicy and this one value within it. Pretty scary to us. Questions:

  • What values should be allowed in baseInterval?
  • What validations can be done to stop misconfigurations like this?
  • Supposing there are other (perhaps future) resource misconfigs / validation issues, how can those be scoped to avoid breaking all routes?
  • How can a user identify the problematic resources, either in envoy-gateway or the envoy proxy? Here we had to guess and test

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Environment:

Include the environment like gateway version, envoy version and so on.

envoy-gateway: v1.2.5

Logs:

Include the access logs and the Envoy logs.

@arkodg arkodg added kind/bug Something isn't working area/xds-translator and removed triage labels Jan 25, 2025
@arkodg arkodg added this to the v1.3.0 milestone Jan 25, 2025
@arkodg arkodg added the help wanted Extra attention is needed label Jan 25, 2025
@arkodg
Copy link
Contributor

arkodg commented Jan 25, 2025

looks like this one managed to escape all the checks, here's the error from the envoy proxy

[2025-01-25 03:01:44.281][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:138] gRPC config for type.googleapis.com/envoy.config.route.v3.RouteConfiguration rejected: Proto constraint validation failed (RouteConfigurationValidationError.VirtualHosts[0]: embedded message failed validation | caused by VirtualHostValidationError.Routes[0]: embedded message failed validation | caused by RouteValidationError.Route: embedded message failed validation | caused by RouteActionValidationError.RetryPolicy: embedded message failed validation | caused by RetryPolicyValidationError.RetryBackOff: embedded message failed validation | caused by RetryBackOffValidationError.BaseInterval: **value must be greater than 0s**): name: "default/eg/http"

We have 3 levels of validation

  1. Apply Time
  • Validated by Kube API Server based on the CRD OpenAPI schema based off Kube-Builder and CEL tags , and config that fails validation is rejected
  1. Runtime
  • Validated by the gateway-api runner in Envoy Gateway
  • More complex validations happen here and if any validation fails
    • A negative status is added to the Policy resource
    • A 500 Direct Response is attached to the targeted Route, implementing a fail closed state, so requests targeting this route will fail, and should be able to debug these faster
  1. xDS Translation
  • We run Validate on the xDs Resource which executes the proto validations for the envoy proxy defined resources, if this fails it should be logged in envoy-gateway. We plan on bubbling this up as a status too in the future.
    It should have been caught here, because we didn't add the must be greater than 0s validation anywhere but it wasn't and the config was pushed to envoy proxy which failed the entire route config

@arkodg
Copy link
Contributor

arkodg commented Jan 25, 2025

@zhaohuabing any idea why the xDS validate didn't kick in ?

this issue can be fixed by adding a CEL validation for this case

@zhaohuabing
Copy link
Member

zhaohuabing commented Jan 25, 2025

@arkodg the validation is done in the ResourceVersionTable.AddXdsResource function, but Routes are added to the RouteConfiguration after tCtx.AddXdsResource(resourcev3.RouteType, xdsRouteCfg) is called. So the validations for RouteConfiguration are skipped.

if xdsRouteCfg == nil {
xdsRouteCfg = &routev3.RouteConfiguration{
IgnorePortInHostMatching: true,
Name: httpListener.Name,
}
if err = tCtx.AddXdsResource(resourcev3.RouteType, xdsRouteCfg); err != nil {
errs = errors.Join(errs, err)
}
}
// Generate xDS virtual hosts and routes for the given HTTPListener,
// and add them to the xDS route config.
if err = t.addRouteToRouteConfig(tCtx, xdsRouteCfg, httpListener, metrics, http3Enabled); err != nil {
errs = errors.Join(errs, err)

This may happen in other xDS validation as well. I'm going to send a PR to fix it.

@zhaohuabing zhaohuabing self-assigned this Jan 25, 2025
@zhaohuabing zhaohuabing removed the help wanted Extra attention is needed label Jan 25, 2025
@zhaohuabing zhaohuabing removed their assignment Jan 25, 2025
@zhaohuabing zhaohuabing added the help wanted Extra attention is needed label Jan 25, 2025
@zhaohuabing
Copy link
Member

Created #5148 to add missing validations. The CEL validation/Gateway API translator validation for baseInterval can be addressed in a separate PR.

@arkodg arkodg self-assigned this Jan 27, 2025
@arkodg arkodg removed the help wanted Extra attention is needed label Jan 27, 2025
arkodg added a commit to arkodg/gateway that referenced this issue Jan 30, 2025
Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
arkodg added a commit to arkodg/gateway that referenced this issue Jan 30, 2025
Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
@arkodg arkodg closed this as completed in 4844d9a Jan 30, 2025
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* more validations

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* more validations

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* more validations

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* more validations

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <guy.daich@sap.com>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* more validations

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <guy.daich@sap.com>
guydc pushed a commit to guydc/gateway that referenced this issue Jan 31, 2025
* fail validation if baseInterval is 0s

Fixes: envoyproxy#5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* more validations

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <guy.daich@sap.com>
guydc added a commit that referenced this issue Jan 31, 2025
* doc: response compression (#5071)

compression docs

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
(cherry picked from commit 549fdde)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* docs: how to specify a self-signed ca for the remote jwks host in the SP JWT settings. (#5085)

* docs for jwt self-signed ca

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* fix gen

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

* update docs

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>

---------

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
(cherry picked from commit fdc7849)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* chore: fix gen (#5166)

fix gen

Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
(cherry picked from commit 34db8af)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* docs: add api key auth instructions (#5097)

* docs: add api key auth instruction

Signed-off-by: Taufik Mulyana <nothinux@gmail.com>

* fix: remove unrelated links

Signed-off-by: Taufik Mulyana <nothinux@gmail.com>

---------

Signed-off-by: Taufik Mulyana <nothinux@gmail.com>
(cherry picked from commit b5cf087)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* add SECURITY.md (#5167)

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
(cherry picked from commit f7a10eb)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* chore: link SECURITY.md (#5168)

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
(cherry picked from commit ac9026f)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* build(deps): bump actions/stale from 9.0.0 to 9.1.0 (#5162)

Bumps [actions/stale](https://github.com/actions/stale) from 9.0.0 to 9.1.0.
- [Release notes](https://github.com/actions/stale/releases)
- [Changelog](https://github.com/actions/stale/blob/main/CHANGELOG.md)
- [Commits](actions/stale@28ca103...5bef64f)

---
updated-dependencies:
- dependency-name: actions/stale
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Arko Dasgupta <arkodg@users.noreply.github.com>
(cherry picked from commit 57d4aa8)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* docs: rm sectionName from some of the examples (#5173)

adds whats left off from #4868

deleted the sectionName in these examples because the Service spec does
not define a port `Name`

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
(cherry picked from commit 45804e2)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* ci(fix): osv-scanner PR mode (#5174)

fix: osv-scanner PR mode

Signed-off-by: shahar-h <shahar.harari@sap.com>
Co-authored-by: Guy Daich <guy.daich@sap.com>
(cherry picked from commit e904d3f)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* wip: docs: add standalone in container instruction (#5172)

* docs: add standalone in container instruction

Signed-off-by: Denis Shatokhin <d_shatokhin@outlook.com>

* docs: update headings and image tag

Signed-off-by: Denis Shatokhin <d_shatokhin@outlook.com>

---------

Signed-off-by: Denis Shatokhin <d_shatokhin@outlook.com>
(cherry picked from commit a3448c1)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* docs: update prerequisites files with installation and connectivity t… (#5094)

* docs: update prerequisites files with installation and connectivity testing steps

Signed-off-by: DeeBi9 <deepanshudb1@gmail.com>

* lint

Signed-off-by: DeeBi9 <deepanshudb1@gmail.com>

* docs: remove the Note

Signed-off-by: DeeBi9 <deepanshudb1@gmail.com>

* remove redundant code

Signed-off-by: DeeBi9 <deepanshudb1@gmail.com>

---------

Signed-off-by: DeeBi9 <deepanshudb1@gmail.com>
(cherry picked from commit 3253339)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* [release/v1.3] fix 1.3.0-rc.1 release note (#5175)

* fix 1.3.0-rc.1 release note

Signed-off-by: Guy Daich <guy.daich@sap.com>

* more fixes

Signed-off-by: Guy Daich <guy.daich@sap.com>

---------

Signed-off-by: Guy Daich <guy.daich@sap.com>
(cherry picked from commit 4fba2bf)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* fail validation if baseInterval is 0s (#5176)

* fail validation if baseInterval is 0s

Fixes: #5147

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

* more validations

Signed-off-by: Arko Dasgupta <arko@tetrate.io>

---------

Signed-off-by: Arko Dasgupta <arko@tetrate.io>
(cherry picked from commit 4844d9a)
Signed-off-by: Guy Daich <guy.daich@sap.com>

* [release/1.3] release notes (#5177)

Signed-off-by: Guy Daich <guy.daich@sap.com>
(cherry picked from commit c2215b2)
Signed-off-by: Guy Daich <guy.daich@sap.com>

---------

Signed-off-by: Huabing Zhao <zhaohuabing@gmail.com>
Signed-off-by: Guy Daich <guy.daich@sap.com>
Signed-off-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Signed-off-by: Taufik Mulyana <nothinux@gmail.com>
Signed-off-by: Arko Dasgupta <arko@tetrate.io>
Signed-off-by: dependabot[bot] <support@github.com>
Signed-off-by: shahar-h <shahar.harari@sap.com>
Signed-off-by: Denis Shatokhin <d_shatokhin@outlook.com>
Signed-off-by: DeeBi9 <deepanshudb1@gmail.com>
Co-authored-by: Huabing (Robin) Zhao <zhaohuabing@gmail.com>
Co-authored-by: Taufik Mulyana <17433202+nothinux@users.noreply.github.com>
Co-authored-by: Arko Dasgupta <arkodg@users.noreply.github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: shahar-h <shahar.harari@sap.com>
Co-authored-by: Denis Shatokhin <d_shatokhin@outlook.com>
Co-authored-by: Deepanshu Bisht <113498676+DeeBi9@users.noreply.github.com>
@dda104
Copy link

dda104 commented Jan 31, 2025

Same problem with all routes 404 when using filters

---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: something
spec:
  parentRefs:
    - name: eg
      namespace: something
      sectionName: something
  hostnames:
    - something.com
  rules:
    - backendRefs:
        - group: ""
          kind: Service
          name: something
          port: 123
          weight: 1
      filters:
        - type: RequestHeaderModifier
          requestHeaderModifier:
            add:
              - name: Host
                value: something.com
              - name: X-Forwarded-Proto
                value: https
              - name: X-Forwarded-Host
                value: something.com
      matches:
        - path:
            type: PathPrefix
            value: /

I understand that filters are not needed here and maybe they are written incorrectly, I'm just making a report that the problem in one httproute affects all httroutes in the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/xds-translator kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants