bastion: Close backend connections on failure #20

niels-moller · 2024-12-02T09:52:43Z

This change aims to close backend connections whenever its Roundtrip method fails.

Please have a look if it makes sense; I'm not familiar with the related http and http2 packages.

niels-moller · 2024-12-02T09:54:23Z

bastion/bastion.go

+			go func() {
+				ctx, cancel := context.WithTimeout(context.Background(), 60*time.Second)
+				defer cancel()
+				cc.Shutdown(ctx)
+			}()


Duplicated from the handleBackend method below, could use some better abstraction.

FiloSottile · 2025-01-14T11:41:14Z

What failure scenario does this address?

In #11 we see a RoundTrip error at 2024/07/02 09:56:10 and then at 2024/07/02 09:56:11 handleBackend returns, meaning the connection was closed. Then the witness doesn't reconnect until one minute later, at which point the p.conns entry is replaced.

I think in that case the only difference this change would make is dropping the p.conns entry for the closed connection, turning whatever error a closed connection causes into a http: proxy error: backend unavailable?

niels-moller · 2025-01-14T13:34:30Z

I don't quite remember my analysis when I came up with this change. But I think my theory was that the connection between bastion and backend (i.e., one of the witnesses) has somehow gotten into a bad state, then the RoudTrip call will return error. If at this point we throw out the backend, disconnecting the underlying tls connection, the witness is likely to reconnect and we will recover from the bad state. Is there ever a good reason to keep calling the RoundTrip method on a http2.ClientConn after that method have started failing?

It's also unclear to my how the error from RoundtTrip is propagated after this code returns, is it sent over the wire to the client (i.e., the log), or written to the bastion's log file?

FiloSottile · 2025-01-15T13:15:56Z

I think my theory was that the connection between bastion and backend (i.e., one of the witnesses) has somehow gotten into a bad state, then the RoudTrip call will return error.

That might be, but it's not consistent with the logs in #11, where the connection is already closed immediately.

Is there ever a good reason to keep calling the RoundTrip method on a http2.ClientConn after that method have started failing?

RoundTrip can fail because the stream times out, or is aborted for any other reason, while the connection is still healthy. Remember that a HTTP/2 backend connection is shared by multiple clients (logs). If we close the whole connection just because e.g. a log sent a request that's too long and the backend aborted it, we are introducing a DoS.

It's also unclear to my how the error from RoundtTrip is propagated after this code returns, is it sent over the wire to the client (i.e., the log), or written to the bastion's log file?

It's written to the log file prefixed with http:. There is a TODO to send it over the wire, but the problem is that the response might already be halfway through, so it's not as simple as writing a 400/500 status.

niels-moller · 2025-01-16T06:46:45Z

Ah, I haven't thought about DoS attacks on the bastion or via the bastion (would be nice with some analysis in the bastion spec). I can think of some different and obvious flavors:

Huge amount of requests.
Excessive amount of headers (client sends endless headers limited only by bandwidth and tcp flow control)
Slow headers (say, client sends a byte at a time, with a sleep of k seconds before the kth byte.
Excessive amount of body data.
Slow body data.

Not clear to me how each of these types of abuse will be handled by the bastion, which ones will reach the Roundtrip method of interest, and how that method might report failure (also depending also on the behavior of the connected backend). And I'm also not familiar with the details of http/2 multiplexing.

I hope the new logging will provide some more info on what's going wrong.

niels-moller · 2025-01-20T07:23:25Z

Two additional observations:

It seems the problems have been temporarily solved by restarting the log server. To me, that indicates a problem with the client-bastion connection (which could also be a long-lived http or http2 connection) rather than the backend-bastion connection. I'd like to understand why, but one potential workaround could be to have the log server tear down the connection when an add-checkpoint request fails, and reconnect at next attempt. Strings like "INTERNAL_ERROR" in the logged error messages, which I suspect originate in the http2 stack, gives me the impression that long-lived http2 isn't stable enough?
On tearing down backend connections: when I looked in the code (which was a few months ago, so might have changed), I failed to find any code path that detects that a backend has disconnected, and drops it from the map of connected backends. I only saw the code to replace any old map entry when a backend reconnects.

bastion: Close backend connections on failure

fcc8475

niels-moller commented Dec 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bastion: Close backend connections on failure #20

bastion: Close backend connections on failure #20

niels-moller commented Dec 2, 2024

niels-moller Dec 2, 2024

FiloSottile commented Jan 14, 2025

niels-moller commented Jan 14, 2025

FiloSottile commented Jan 15, 2025

niels-moller commented Jan 16, 2025

niels-moller commented Jan 20, 2025

bastion: Close backend connections on failure #20

Are you sure you want to change the base?

bastion: Close backend connections on failure #20

Conversation

niels-moller commented Dec 2, 2024

niels-moller Dec 2, 2024

Choose a reason for hiding this comment

FiloSottile commented Jan 14, 2025

niels-moller commented Jan 14, 2025

FiloSottile commented Jan 15, 2025

niels-moller commented Jan 16, 2025

niels-moller commented Jan 20, 2025