control: Ensure endpoints are driven to readiness #1014

olix0r · 2021-05-20T19:50:35Z

When there are multiple replicas of a controller--especially the
destination controller--the proxy creates a load balancer to distribute
requests across all controller pods.

linkerd/linkerd2#6146 describes a situation where controller connections
fail to be established because the client stalls for 50s+ between
initiating a connection and sending a TLS ClientHello, long after the
server has timed out the idle connection.

As it turns out, the controller client does not necessarily drive all of
its endpoints to readiness. Because load balancers are designed to
process requests when only a subset of endpoints are available, the load
balancer cannot be responsible for driving all endpoints in a service to
readiness and we need a SpawnReady layer that is responsible for
driving individual endpoints to readiness. While the outbound proxy's
balancers are instrumented with this layer, the controller clients were
not configured this way when load balancers were introduced.

We likely have not encountered this previously because the balancer
should effectively hide this problem in most cases: as long as a single
endpoint is available requests should be processed as expected; and if
there are no endpoints available, the balancer would drive at least one
to readiness in order to process requests.

When there are multiple replicas of a controller--especially the destination controller--the proxy creates a load balancer to distribute requests across all controller pods. linkerd/linkerd2#6146 describes a situation where controller connections fail to be established because the client stalls for 50s+ between initiating a connection and sending a TLS ClientHello, long after the server has timed out the idle connection. As it turns out, the controller client does not necessarily drive all of its endpoints to readiness. Because load balancers are designed to process requests when only a subset of endpoints are available, the load balancer cannot be responsible for driving all endpoints in a service to readiness and we need a `SpawnReady` layer that is responsible for driving individual endpoints to readiness. While the outbound proxy's balancers are instrumented with this layer, the controller clients were not configured this way when load balancers were introduced. We likely have not encountered this previously because the balancer should effectively hide this problem in most cases: as long as a single endpoint is available requests should be processed as expected; and if there are no endpoints available, the balancer would drive at least one to readiness in order to process requests.

kleimkuhler

Looks good

hawkw

lgtm!

* Controller clients of components with more than one replica could fail to drive all connections to completion. This could result in timeouts showing up in logs, but would not have prevented proxies from communicating with controllers. #6146 * linkerd/linkerd2-proxy#992 made the `l5d-dst-override` header required for ingress-mode proxies. This behavior has been reverted so that requests without this header are forwarded to their original destination. * OpenCensus trace spans for HTTP requests no longer include query parameters. --- * ci: Update/pin action dependencies (linkerd/linkerd2-proxy#1012) * control: Ensure endpoints are driven to readiness (linkerd/linkerd2-proxy#1014) * Make span name without query string (linkerd/linkerd2-proxy#1013) * ingress: Restore original dst address routing (linkerd/linkerd2-proxy#1016) * ci: Restict permissions in Actions (linkerd/linkerd2-proxy#1019) * Forbid unsafe code in most module (linkerd/linkerd2-proxy#1018)

olix0r requested a review from a team May 20, 2021 19:50

olix0r mentioned this pull request May 20, 2021

Outbound proxy delays writing of TLS handshake message linkerd/linkerd2#6146

Closed

kleimkuhler approved these changes May 20, 2021

View reviewed changes

hawkw approved these changes May 20, 2021

View reviewed changes

olix0r merged commit 3d02866 into main May 21, 2021

olix0r deleted the ver/control-spawn-ready branch May 21, 2021 00:17

olix0r mentioned this pull request May 27, 2021

proxy: v2.145.0 linkerd/linkerd2#6187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

control: Ensure endpoints are driven to readiness #1014

control: Ensure endpoints are driven to readiness #1014

olix0r commented May 20, 2021

kleimkuhler left a comment

hawkw left a comment

control: Ensure endpoints are driven to readiness #1014

control: Ensure endpoints are driven to readiness #1014

Conversation

olix0r commented May 20, 2021

kleimkuhler left a comment

Choose a reason for hiding this comment

hawkw left a comment

Choose a reason for hiding this comment