Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector source/sink not propagating non-retryable failures #17895

Closed
dsmith3197 opened this issue Jul 6, 2023 Discussed in #17873 · 1 comment · Fixed by #17904
Closed

Vector source/sink not propagating non-retryable failures #17895

dsmith3197 opened this issue Jul 6, 2023 Discussed in #17873 · 1 comment · Fixed by #17904
Labels
sink: vector Anything `vector` sink related source: vector Anything `vector` source related type: bug A code related bug.

Comments

@dsmith3197
Copy link
Contributor

Discussed in #17873

Originally posted by sbalmos July 5, 2023
I'm still trying to quantify exactly what the bug is, but it seems like the vector source/sink does not propagate back up non-retryable delivery failures from the end sink. In my setup, I have a Vector instance as a sort of post office delivery multiplexer, reading from Kafka and distributing to one or more exporter Vector instances, connected via the vector sink.

                            / -> (vector sink) -> (vector source) Exporter 1 -> http sink
Kafka -> (kafka source) Mux | -> (vector sink) -> (vector source) Exporter 2 -> loki sink
                            \ -> (vector sink) -> (vector source) Exporter 3 -> splunk_hec_logs sink

In exporters where some failures are non-retryable (e.g. HTTP sink with non-retryable errors like 400), it seems as if at the exporter level the event is appropriately dropped. However, this drop action apparently does not communicate back a hard failure acknowledgement or similar signal through the vector protocol, back to the Mux instance. The vector sink retry on the Mux instance apparently sees a delivery failure (or rather lack of acknowledgement) and repeatedly tries to retry delivery of the message to the exporter. This continues ad nauseam until the retry count is exceeded or (more likely) the exporter's vector sink buffer on the Mux is filled - which then has a follow-on bad behavior of stopping the whole mux in its tracks, stopping delivery of all messages to all destination exporters.

@dsmith3197 dsmith3197 added sink: vector Anything `vector` sink related source: vector Anything `vector` source related type: bug A code related bug. labels Jul 6, 2023
@dsmith3197
Copy link
Contributor Author

The issue here is that the Vector source will respond with internal or data_loss error codes (code ref) on failed delivery, but the Vector sink treats those as retryable errors (code ref).

To resolve this, we need to update either the Vector source or sink to make them consistent in how they handle rejected data.

github-merge-queue bot pushed a commit that referenced this issue Jul 7, 2023
Fixes #17895 as discussed in #17873. The Vector source propagates
downstream non-retryable errors as either DataLoss or Internal GRPC
error codes. However, these error codes are not treated as non-retryable
by the corresponding upstream Vector sink, leading to a delivery retry
loop.

(update): Only treating DataLoss as non-retryable. As explained by
@dsmith3197 in a commit review - Internal errors occur when At least one
event in the batch had a transient error in delivery and DataLoss errors
occur when At least one event in the batch had a permanent failure or
rejection, with Internal taking precedence over Dataloss. With that in
mind, we'll want to retry for Internal, but not DataLoss.

<!--
**Your PR title must conform to the conventional commit spec!**

  <type>(<scope>)!: <description>

  * `type` = chore, enhancement, feat, fix, docs
  * `!` = OPTIONAL: signals a breaking change
* `scope` = Optional when `type` is "chore" or "docs", available scopes
https://github.com/vectordotdev/vector/blob/master/.github/semantic.yml#L20
  * `description` = short description of the change

Examples:

  * enhancement(file source): Add `sort` option to sort discovered files
  * feat(new source): Initial `statsd` source
  * fix(file source): Fix a bug discovering new files
  * chore(external docs): Clarify `batch_size` option
-->

---------

Co-authored-by: Doug Smith <dsmith3197@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sink: vector Anything `vector` sink related source: vector Anything `vector` source related type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant