Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

control: Ensure endpoints are driven to readiness #1014

Merged
merged 1 commit into from
May 21, 2021

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented May 20, 2021

When there are multiple replicas of a controller--especially the
destination controller--the proxy creates a load balancer to distribute
requests across all controller pods.

linkerd/linkerd2#6146 describes a situation where controller connections
fail to be established because the client stalls for 50s+ between
initiating a connection and sending a TLS ClientHello, long after the
server has timed out the idle connection.

As it turns out, the controller client does not necessarily drive all of
its endpoints to readiness. Because load balancers are designed to
process requests when only a subset of endpoints are available, the load
balancer cannot be responsible for driving all endpoints in a service to
readiness and we need a SpawnReady layer that is responsible for
driving individual endpoints to readiness. While the outbound proxy's
balancers are instrumented with this layer, the controller clients were
not configured this way when load balancers were introduced.

We likely have not encountered this previously because the balancer
should effectively hide this problem in most cases: as long as a single
endpoint is available requests should be processed as expected; and if
there are no endpoints available, the balancer would drive at least one
to readiness in order to process requests.

When there are multiple replicas of a controller--especially the
destination controller--the proxy creates a load balancer to distribute
requests across all controller pods.

linkerd/linkerd2#6146 describes a situation where controller connections
fail to be established because the client stalls for 50s+ between
initiating a connection and sending a TLS ClientHello, long after the
server has timed out the idle connection.

As it turns out, the controller client does not necessarily drive all of
its endpoints to readiness. Because load balancers are designed to
process requests when only a subset of endpoints are available, the load
balancer cannot be responsible for driving all endpoints in a service to
readiness and we need a `SpawnReady` layer that is responsible for
driving individual endpoints to readiness. While the outbound proxy's
balancers are instrumented with this layer, the controller clients were
not configured this way when load balancers were introduced.

We likely have not encountered this previously because the balancer
should effectively hide this problem in most cases: as long as a single
endpoint is available requests should be processed as expected; and if
there are no endpoints available, the balancer would drive at least one
to readiness in order to process requests.
Copy link
Contributor

@kleimkuhler kleimkuhler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

Copy link
Contributor

@hawkw hawkw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@olix0r olix0r merged commit 3d02866 into main May 21, 2021
@olix0r olix0r deleted the ver/control-spawn-ready branch May 21, 2021 00:17
olix0r added a commit to linkerd/linkerd2 that referenced this pull request May 27, 2021
* Controller clients of components with more than one replica could fail
  to drive all connections to completion. This could result in timeouts
  showing up in logs, but would not have prevented proxies from
  communicating with controllers. #6146
* linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required
  for ingress-mode proxies. This behavior has been reverted so that
  requests without this header are forwarded to their original
  destination.
* OpenCensus trace spans for HTTP requests no longer include query
  parameters.

---

* ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012)
* control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014)
* Make span name without query string (linkerd/linkerd2-proxy#1013)
* ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016)
* ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019)
* Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)
olix0r added a commit to linkerd/linkerd2 that referenced this pull request May 27, 2021
* Controller clients of components with more than one replica could fail
  to drive all connections to completion. This could result in timeouts
  showing up in logs, but would not have prevented proxies from
  communicating with controllers. #6146
* linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required
  for ingress-mode proxies. This behavior has been reverted so that
  requests without this header are forwarded to their original
  destination.
* OpenCensus trace spans for HTTP requests no longer include query
  parameters.

---

* ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012)
* control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014)
* Make span name without query string (linkerd/linkerd2-proxy#1013)
* ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016)
* ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019)
* Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants