Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics don't ship to Datadog after 0.34.0 upgrade #19110

Closed
a26nine opened this issue Nov 10, 2023 · 13 comments · Fixed by #19138
Closed

Metrics don't ship to Datadog after 0.34.0 upgrade #19110

a26nine opened this issue Nov 10, 2023 · 13 comments · Fixed by #19138
Assignees
Labels
meta: regression This issue represents a regression type: bug A code related bug.

Comments

@a26nine
Copy link

a26nine commented Nov 10, 2023

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

I upgraded from 0.33.1 to 0.34.0, and the metrics stopped on Datadog.

I see this in the changelog:

The datadog_agent source now records the “interval” on any incoming metrics that have it set rather than just rate. This is useful as metrics can be interpreted as rates later when viewing the data in Datadog, where the interval field will be used.

Config file loads fine. Healtcheck is passed. However, metrics don't appear on Datadog —they do appear on other sink destinations.

image

Configuration

sink_dd:
    type: datadog_metrics
    inputs:
      - source_dd
    default_api_key: xxx
    batch:
      timeout_secs: 5
      max_bytes: 4194304
    buffer:
      type: disk
      max_size: 4294967296
      when_full: drop_newest
    request:
      rate_limit_duration_secs: 5
      rate_limit_num: 100
      retry_max_duration_secs: 600

Version

0.34.0

Debug Output

No response

Example Data

No response

Additional Context

No response

References

No response

@a26nine a26nine added the type: bug A code related bug. label Nov 10, 2023
@jszwedko jszwedko added the meta: regression This issue represents a regression label Nov 10, 2023
@StephenWakely
Copy link
Contributor

StephenWakely commented Nov 10, 2023

Ok, we have figured out what is likely going on. This release updated Vector to use a new endpoint in Datadog which unfortunately accepts a much smaller payload size. Vector is optimised for the larger size. The result is now Datadog is rejecting the batches being sent in.

You have two options:

  1. Don't use v0.34.0. It looks like you have downgraded already. Keep it like that, we will get a patch release out shortly.
  2. Use v0.34.0, but configure Vector to use the old V1 endpoint. This can be done by setting an environment variable VECTOR_TEMP_USE_DD_METRICS_SERIES_V1_API=1

Sorry for the inconvenience.

@a26nine
Copy link
Author

a26nine commented Nov 10, 2023

Ok, we have figured out what is likely going on. This release updated Vector to use a new endpoint in Datadog which unfortunately accepts a much smaller payload size. Vector is optimised for the larger size. The result is now Datadog is rejecting the batches being sent in.

You have two options:

  1. Don't use v0.34.0. It looks like you have downgraded already. Keep it like that, we will get a patch release out shortly.
  2. Use v0.34.0, but configure Vector to use the old V1 endpoint. This can be done by setting an environment variable VECTOR_TEMP_USE_DD_METRICS_SERIES_V1_API=1

Sorry for the inconvenience.

Thanks for looking into this promptly.

I have a couple of questions:

  • Will the patch have any impact on the batch size limits? And will the patch be released anytime soon?
  • Will the v1 endpoint work fine without breaking anything? Was v1 being used till 0.33.x?

@jszwedko
Copy link
Member

Just a small update: we are digging into this now to come up with a correct fix. Thanks for the report!

I have a couple of questions:

  • Will the patch have any impact on the batch size limits? And will the patch be released anytime soon?

The patch will likely change the batch limits for the v2 endpoints

  • Will the v1 endpoint work fine without breaking anything? Was v1 being used till 0.33.x?

The v1 endpoints were in-use up until v0.34.0 so the flag just reverts the behavior back to what you were using prior to the upgrade.

jszwedko added a commit that referenced this issue Nov 10, 2023
Ref: #19110

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
@a26nine
Copy link
Author

a26nine commented Nov 13, 2023

Just a small update: we are digging into this now to come up with a correct fix. Thanks for the report!

I have a couple of questions:

  • Will the patch have any impact on the batch size limits? And will the patch be released anytime soon?

The patch will likely change the batch limits for the v2 endpoints

  • Will the v1 endpoint work fine without breaking anything? Was v1 being used till 0.33.x?

The v1 endpoints were in-use up until v0.34.0 so the flag just reverts the behavior back to what you were using prior to the upgrade.

Okay. Looking forward to the patch.

@neuronull
Copy link
Contributor

neuronull commented Nov 13, 2023

These are the steps to repro this issue:

  • Generate the input events file
  • Start vector v0.34.0 using the vector.toml config which accepts statsd metrics and sends to Datadog.
  • Send metrics to Vector
  • Observe the 413 errors reported in the vector debug logs , and lack of metrics in the DD UI

Generating the input file

The metrics must be unique so as not to be aggregated.

for i in {1..6666000}
do
  echo "foo:1|c|#tag:$i" >> events.txt
done

Vector config

[sources.in]
type = "statsd"
address = "0.0.0.0:9000"
mode = "tcp"
path = "/tmp/tmp.socket"

[sinks.dd]
inputs = [ "in" ]
type = "datadog_metrics"
default_api_key="${TEST_DATADOG_API_KEY}"
batch.max_events = 8000
batch.timeout_secs = 999999
buffer.type =  "disk"
buffer.max_size = 4294967296
buffer.when_full = "drop_newest"

Start Vector

(Debug log level needs to be set in order to confirm the 413 HTTP status code)

vector -c ./vector.toml -v

Sending the metrics

socat -dd OPEN:events.txt TCP:localhost:9000

github-merge-queue bot pushed a commit that referenced this issue Nov 13, 2023
…#19122)

* chore(releasing): Add known issue for Datadog Metrics sink in v0.34.0

Ref: #19110

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Update website/cue/reference/releases/0.34.0.cue

Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>
jszwedko added a commit that referenced this issue Nov 14, 2023
…#19122)

* chore(releasing): Add known issue for Datadog Metrics sink in v0.34.0

Ref: #19110

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Update website/cue/reference/releases/0.34.0.cue

Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>
@neuronull
Copy link
Contributor

👋 hi @a26nine , have an update on this- the fix/patch will roll back to using the v1 endpoint. We have remaining work to do to sort out some strange performance we saw with the v2 endpoint payload limits. We will plan to get that ironed out and well tested for v0.35.0.

pront pushed a commit to dygfloyd/vector that referenced this issue Nov 15, 2023
…vectordotdev#19122)

* chore(releasing): Add known issue for Datadog Metrics sink in v0.34.0

Ref: vectordotdev#19110

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Update website/cue/reference/releases/0.34.0.cue

Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>
pront pushed a commit to dygfloyd/vector that referenced this issue Nov 15, 2023
…vectordotdev#19122)

* chore(releasing): Add known issue for Datadog Metrics sink in v0.34.0

Ref: vectordotdev#19110

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Update website/cue/reference/releases/0.34.0.cue

Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>
pront added a commit to dygfloyd/vector that referenced this issue Nov 15, 2023
fix yaml test config and also format the code

fix wrong usage log.get, now all tests are passing

Ran cargo vdev fmt

Fix reference in to_metrics_metadata in log_to_metric.rs

Convert type errors to single error with type.

Added test for multiple metrics but it is failing.

Additional attempts to get the multiple_metadata_metrics working.

fix(playground): fix playground vrl version and link (vectordotdev#19119)

* fix(playground): fix playground vrl version and link

* Ran cargo vdev fmt

* remove redundant line

* replace clone() with as_ref() whenever possible

* Ran cargo vdev fmt

chore(website): Fix commenting step on workflow (vectordotdev#19134)

* fix: extra env vars into comment step

* fix: add sleep to allow branch to connect

* fix: update where branch name and sanitized branch name are used

Remove multiple metrics test.

chore(ci): Bump bufbuild/buf-setup-action from 1.27.2 to 1.28.0 (vectordotdev#19137)

Bumps [bufbuild/buf-setup-action](https://github.com/bufbuild/buf-setup-action) from 1.27.2 to 1.28.0.
- [Release notes](https://github.com/bufbuild/buf-setup-action/releases)
- [Commits](bufbuild/buf-setup-action@v1.27.2...v1.28.0)

---
updated-dependencies:
- dependency-name: bufbuild/buf-setup-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

chore(deps): Bump the clap group with 1 update (vectordotdev#19127)

Bumps the clap group with 1 update: [clap](https://github.com/clap-rs/clap).

- [Release notes](https://github.com/clap-rs/clap/releases)
- [Changelog](https://github.com/clap-rs/clap/blob/master/CHANGELOG.md)
- [Commits](clap-rs/clap@v4.4.7...v4.4.8)

---
updated-dependencies:
- dependency-name: clap
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: clap
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

chore(releasing): Add known issue for Datadog Metrics sink in v0.34.0 (vectordotdev#19122)

* chore(releasing): Add known issue for Datadog Metrics sink in v0.34.0

Ref: vectordotdev#19110

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Update website/cue/reference/releases/0.34.0.cue

Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

chore(deps): Bump proptest from 1.3.1 to 1.4.0 (vectordotdev#19131)

Bumps [proptest](https://github.com/proptest-rs/proptest) from 1.3.1 to 1.4.0.
- [Release notes](https://github.com/proptest-rs/proptest/releases)
- [Changelog](https://github.com/proptest-rs/proptest/blob/master/CHANGELOG.md)
- [Commits](proptest-rs/proptest@v1.3.1...v1.4.0)

---
updated-dependencies:
- dependency-name: proptest
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

chore(deps): Bump env_logger from 0.10.0 to 0.10.1 (vectordotdev#19130)

Bumps [env_logger](https://github.com/rust-cli/env_logger) from 0.10.0 to 0.10.1.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](rust-cli/env_logger@v0.10.0...v0.10.1)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

chore(docs): Add alpha to traces and beta to metrics in descriptions (vectordotdev#19139)

add alpha to traces and beta to metrics in descriptions

Update README.md (vectordotdev#19142)

chore(deps): Bump hdrhistogram from 7.5.2 to 7.5.3 (vectordotdev#19129)

Bumps [hdrhistogram](https://github.com/HdrHistogram/HdrHistogram_rust) from 7.5.2 to 7.5.3.
- [Release notes](https://github.com/HdrHistogram/HdrHistogram_rust/releases)
- [Changelog](https://github.com/HdrHistogram/HdrHistogram_rust/blob/main/CHANGELOG.md)
- [Commits](HdrHistogram/HdrHistogram_rust@v7.5.2...v7.5.3)

---
updated-dependencies:
- dependency-name: hdrhistogram
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

enhancement(file sink, aws_s3 sink, gcp_cloud_storage): configurable filename timezone (vectordotdev#18506)

* add TzOffset as File Sink configuration

* integrate TzOffset into File Sink

* apply tz offset to all log event in render_timestamp.

* added TzOffset tests

* adding chrono-tz for parsing timezones

* rename tz_offset to path_tz. timezones are safer than offsets

* update tz_offset references to path_tz

* cargo fmt

* remove unnecessary commented out code. fmt and generate-component-docs

* clippy suggestions and remove TryFrom<&str> - serde handles converting to String

* rename Template config option `path_tz` to `timezone`

* move `path_tz.rs` to `src/config`

preparing for applying the same to `aws_s3` sink for filename timezone

* update doc configuration description for path_tz

* fix wrong method name

* AWS and GCS filename timezone support

  * remove custom tz config
  * use VRL's timezone config
  * pass around SinkContext

* use TzOffset to pass down to request builders. VRL's TimeZone can't be hash derived

* make key_prefix timezone aware and use Option `or` syntax

* move tz to offset conversion codes to sink util

* remove empty line

* update timezone docs in vector-config

* get timezone and convert to offset in one go in FileSink

* just pass the sinkconfig directly

* vector_common to vector_lib

* configurable_component is in vector_lib now

* lookup to vector_lib

* fix aws s3 integration test. pass the context to build_processor in tests

* formatting

* add sinkcontext to FileSink in file tests

* key_prefix is expected to be a template. no need for into

Update documentation

Updated documentation to include example

enhancement(networking, sinks): add full jitter to retry backoff policy (vectordotdev#19106)

* enhancement(networking, sinks): add full jitter to retry backoff policy

* fmt

* fix tests

* add test

* fix

* force ci

fix(file source, kubernetes_logs source, file sink): make file internal metric tag opt-in (vectordotdev#19145)

* fix(file source, kubernetes_logs source, file sink): make file internal metric tag opt-in

* update cue

* fix tests

fix(datadog_metrics sink): evaluate series v1 env var at runtime (vectordotdev#19148)

fix(datadog_metrics sink): evaluate v1 env var at runtime

chore(website): WEB-4247 | Update references from s3 to setup.vector.dev (vectordotdev#19149)

feat: update references from s3 to setup.vector.dev

fix(ARC, networking): improve request settings (vectordotdev#19101)

* fix(ARC, networking): improve request settings

* fix spelling

* change defaults

* refactor

* self-review

* clippy

* update default overrides

* fmt nit

* add upgrade guide entry

Update src/transforms/log_to_metric.rs

Co-authored-by: Pavlos Rontidis <pavlos.rontidis@gmail.com>

Tweak and generate docs

Tweak and generate docs

Update website/cue/reference/components/transforms/base/log_to_metric.cue

Co-authored-by: Heston Hoffman <hestonhoffman@gmail.com>

Update website/cue/reference/components/transforms/base/log_to_metric.cue

Co-authored-by: Heston Hoffman <hestonhoffman@gmail.com>

Update website/cue/reference/components/transforms/base/log_to_metric.cue

Co-authored-by: Heston Hoffman <hestonhoffman@gmail.com>

Update src/transforms/log_to_metric.rs

Co-authored-by: Heston Hoffman <hestonhoffman@gmail.com>

Update src/transforms/log_to_metric.rs

Co-authored-by: Heston Hoffman <hestonhoffman@gmail.com>

Update src/transforms/log_to_metric.rs

Co-authored-by: Heston Hoffman <hestonhoffman@gmail.com>

Fix styling in example for all_metrics.

Minor tweaks to formatting for documentation

Fix minor issues with formatting that was causing tests to fail.
@a26nine
Copy link
Author

a26nine commented Nov 16, 2023

👋 hi @a26nine , have an update on this- the fix/patch will roll back to using the v1 endpoint. We have remaining work to do to sort out some strange performance we saw with the v2 endpoint payload limits. We will plan to get that ironed out and well tested for v0.35.0.

Thanks, @neuronull! Awaiting the patch release.

@a26nine
Copy link
Author

a26nine commented Nov 22, 2023

@jszwedko @neuronull 0.34.1 is still broken. The metrics are not shipped to Datadog.

@neuronull
Copy link
Contributor

@jszwedko @neuronull 0.34.1 is still broken. The metrics are not shipped to Datadog.

Hi @a26nine, would you mind opening a new issue with your config and include debug logs? I just re-tested 0.34.1 again with the payload limits test and am still seeing it go through to Datadog (and on the v1 endpoint).

@a26nine
Copy link
Author

a26nine commented Nov 23, 2023

@jszwedko @neuronull 0.34.1 is still broken. The metrics are not shipped to Datadog.

Hi @a26nine, would you mind opening a new issue with your config and include debug logs? I just re-tested 0.34.1 again with the payload limits test and am still seeing it go through to Datadog (and on the v1 endpoint).

I noticed that the uri in 0.34.1 changed to uri=https://0-34-1-vector.agent.datadoghq.com/api/v1/series. This doesn't look right. Also, it's passing API key from the Datadog Agent source in the header instead of reading it from the sink config.

0.33.1

vector[33763]: 2023-11-23T11:40:41.029584Z DEBUG sink{component_kind="sink" component_id=dd_sink_dd component_type=datadog_metrics component_name=dd_sink_dd}:request{request_id=28}:http: vector::internal_events::http_client: Sending HTTP request. uri=https://0-33-1-vector.agent.datadoghq.com/api/v1/series method=POST version=HTTP/1.1 headers={"dd-api-key": "VECTOR_SINK_CONFIG_KEY", "dd-agent-payload": "4.87.0", "content-type": "application/json", "content-encoding": "deflate", "user-agent": "Vector/0.33.1 (x86_64-unknown-linux-gnu 3cc27b9 2023-10-30 16:50:49.747931844)", "accept-encoding": "identity"} body=[19568 bytes]

0.34.1

vector[31541]: 2023-11-23T11:30:40.904288Z DEBUG sink{component_kind="sink" component_id=dd_sink_dd component_type=datadog_metrics}:request{request_id=14}:http: vector::internal_events::http_client: Sending HTTP request. uri=https://0-34-1-vector.agent.datadoghq.com/api/v1/series method=POST version=HTTP/1.1 headers={"dd-api-key": "DD_AGENT_CONFIG_KEY", "dd-agent-payload": "4.87.0", "content-type": "application/json", "content-encoding": "deflate", "user-agent": "Vector/0.34.1 (x86_64-unknown-linux-gnu 86f1c22 2023-11-16 14:59:10.486846964)", "accept-encoding": "identity"} body=[21792 bytes]

@a26nine
Copy link
Author

a26nine commented Nov 23, 2023

The configuration and everything else is same as shared in the OP. It's just the difference of versions. I toggled between versions multiple times to confirm the issue.

@bruceg
Copy link
Member

bruceg commented Nov 23, 2023

I think I see what you are referencing, @a26nine, but that seems like a trivial difference:

uri=https://0-33-1-vector.agent.datadoghq.com/api/v1/series
uri=https://0-34-1-vector.agent.datadoghq.com/api/v1/series

The only other difference between the two is the size in bytes, which could easily be explained by different underlying events going through.

@neuronull
Copy link
Contributor

Also, it's passing API key from the Datadog Agent source in the header instead of reading it from the sink config

That is intended/expected behavior that has been unchanged for a few releases.

https://vector.dev/docs/reference/configuration/sinks/datadog_metrics/#default_api_key

AndrooTheChen pushed a commit to discord/vector that referenced this issue Sep 23, 2024
…vectordotdev#19122)

* chore(releasing): Add known issue for Datadog Metrics sink in v0.34.0

Ref: vectordotdev#19110

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>

* Update website/cue/reference/releases/0.34.0.cue

Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

---------

Signed-off-by: Jesse Szwedko <jesse.szwedko@datadoghq.com>
Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
meta: regression This issue represents a regression type: bug A code related bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants