Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bastion: Close backend connections on failure #20

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

niels-moller
Copy link

This change aims to close backend connections whenever its Roundtrip method fails.

Please have a look if it makes sense; I'm not familiar with the related http and http2 packages.

Comment on lines +185 to +189
go func() {
ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
defer cancel()
cc.Shutdown(ctx)
}()
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated from the handleBackend method below, could use some better abstraction.

@FiloSottile
Copy link
Owner

What failure scenario does this address?

In #11 we see a RoundTrip error at 2024/07/02 09:56:10 and then at 2024/07/02 09:56:11 handleBackend returns, meaning the connection was closed. Then the witness doesn't reconnect until one minute later, at which point the p.conns entry is replaced.

I think in that case the only difference this change would make is dropping the p.conns entry for the closed connection, turning whatever error a closed connection causes into a http: proxy error: backend unavailable?

@niels-moller
Copy link
Author

I don't quite remember my analysis when I came up with this change. But I think my theory was that the connection between bastion and backend (i.e., one of the witnesses) has somehow gotten into a bad state, then the RoudTrip call will return error. If at this point we throw out the backend, disconnecting the underlying tls connection, the witness is likely to reconnect and we will recover from the bad state. Is there ever a good reason to keep calling the RoundTrip method on a http2.ClientConn after that method have started failing?

It's also unclear to my how the error from RoundtTrip is propagated after this code returns, is it sent over the wire to the client (i.e., the log), or written to the bastion's log file?

@FiloSottile
Copy link
Owner

I think my theory was that the connection between bastion and backend (i.e., one of the witnesses) has somehow gotten into a bad state, then the RoudTrip call will return error.

That might be, but it's not consistent with the logs in #11, where the connection is already closed immediately.

Is there ever a good reason to keep calling the RoundTrip method on a http2.ClientConn after that method have started failing?

RoundTrip can fail because the stream times out, or is aborted for any other reason, while the connection is still healthy. Remember that a HTTP/2 backend connection is shared by multiple clients (logs). If we close the whole connection just because e.g. a log sent a request that's too long and the backend aborted it, we are introducing a DoS.

It's also unclear to my how the error from RoundtTrip is propagated after this code returns, is it sent over the wire to the client (i.e., the log), or written to the bastion's log file?

It's written to the log file prefixed with http:. There is a TODO to send it over the wire, but the problem is that the response might already be halfway through, so it's not as simple as writing a 400/500 status.

@niels-moller
Copy link
Author

Ah, I haven't thought about DoS attacks on the bastion or via the bastion (would be nice with some analysis in the bastion spec). I can think of some different and obvious flavors:

  1. Huge amount of requests.
  2. Excessive amount of headers (client sends endless headers limited only by bandwidth and tcp flow control)
  3. Slow headers (say, client sends a byte at a time, with a sleep of k seconds before the kth byte.
  4. Excessive amount of body data.
  5. Slow body data.

Not clear to me how each of these types of abuse will be handled by the bastion, which ones will reach the Roundtrip method of interest, and how that method might report failure (also depending also on the behavior of the connected backend). And I'm also not familiar with the details of http/2 multiplexing.

I hope the new logging will provide some more info on what's going wrong.

@niels-moller
Copy link
Author

Two additional observations:

  1. It seems the problems have been temporarily solved by restarting the log server. To me, that indicates a problem with the client-bastion connection (which could also be a long-lived http or http2 connection) rather than the backend-bastion connection. I'd like to understand why, but one potential workaround could be to have the log server tear down the connection when an add-checkpoint request fails, and reconnect at next attempt. Strings like "INTERNAL_ERROR" in the logged error messages, which I suspect originate in the http2 stack, gives me the impression that long-lived http2 isn't stable enough?
  2. On tearing down backend connections: when I looked in the code (which was a few months ago, so might have changed), I failed to find any code path that detects that a backend has disconnected, and drops it from the map of connected backends. I only saw the code to replace any old map entry when a backend reconnects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants