Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1955: fix wrong metrics for conversation tracking #1096

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jotak
Copy link
Member

@jotak jotak commented Feb 3, 2025

Description

This is fixing wrong metrics that are due to accumulating bytes/packets counters from heartbits, which are not relevant for accumulation.

However, conversation tracking is still bad at providing accurate metrics for long-standing connections, as the ConnectionEnded event may be very long to come. So this is going to remain as a known limitation of Conversation Tracking.

I don't think there's any urgency to backport this on 1.8

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Feb 3, 2025

@jotak: This pull request references NETOBSERV-1955 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.19.0" version, but no target version was set.

In response to this:

Description

This is fixing wrong metrics that are due to accumulating bytes/packets counters from heartbits, which are not relevant for accumulation.

However, conversation tracking is still bad at providing accurate metrics for long-standing connections, as the ConnectionEnded event may be very long to come. So this is going to remain as a known limitation of Conversation Tracking.

I don't think there's any urgency to backport this on 1.8

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labeled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

openshift-ci bot commented Feb 3, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Feb 3, 2025

Codecov Report

Attention: Patch coverage is 71.42857% with 6 lines in your changes missing coverage. Please review.

Project coverage is 62.66%. Comparing base (626c5a9) to head (e1ba68d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/metrics/predefined_metrics.go 71.42% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1096      +/-   ##
==========================================
+ Coverage   62.42%   62.66%   +0.23%     
==========================================
  Files          77       77              
  Lines       11581    11599      +18     
==========================================
+ Hits         7230     7269      +39     
+ Misses       3896     3878      -18     
+ Partials      455      452       -3     
Flag Coverage Δ
unittests 62.66% <71.42%> (+0.23%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
apis/flowcollector/v1beta2/flowcollector_types.go 100.00% <ø> (ø)
pkg/metrics/predefined_metrics.go 97.36% <71.42%> (-2.64%) ⬇️

... and 5 files with indirect coverage changes

Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@Amoghrd
Copy link
Contributor

Amoghrd commented Feb 4, 2025

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Feb 4, 2025
Copy link

github-actions bot commented Feb 4, 2025

New images:

  • quay.io/netobserv/network-observability-operator:4a1ca36
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-4a1ca36
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-4a1ca36

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:4a1ca36 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-4a1ca36

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-4a1ca36
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@Amoghrd
Copy link
Contributor

Amoghrd commented Feb 24, 2025

Need https://issues.redhat.com/browse/NETOBSERV-2108 fix to land to test this PR

@jotak
Copy link
Member Author

jotak commented Mar 5, 2025

@Amoghrd until NETOBSERV-2108 is done it's still possible to test the operator via the other alternatives (with operator-sdk, or with make deploy)

@Amoghrd
Copy link
Contributor

Amoghrd commented Mar 6, 2025

oh yeah its been a while since I used it, so had forgotten about it. Will test it using make deploy

@Amoghrd
Copy link
Contributor

Amoghrd commented Mar 6, 2025

@jotak Could you rebase the PR

@Amoghrd Amoghrd removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 6, 2025
@Amoghrd
Copy link
Contributor

Amoghrd commented Mar 6, 2025

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 6, 2025
Copy link

github-actions bot commented Mar 6, 2025

New images:

  • quay.io/netobserv/network-observability-operator:b3f2bf0
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-b3f2bf0
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-b3f2bf0

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:b3f2bf0 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-b3f2bf0

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-b3f2bf0
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@Amoghrd
Copy link
Contributor

Amoghrd commented Mar 6, 2025

@jotak I am not understanding how to go about verifying this. I deployed the operator and flowcollector with Conversations enabled. I am looking at the metrics for conversation tracking below
Screenshot 2025-03-06 at 2 48 25 PM

Should I be looking at something specific here?

@jotak
Copy link
Member Author

jotak commented Mar 12, 2025

@Amoghrd just compare the metrics dashboards with & without conversation tracking. Enabling conversation tracking should not make the metrics look very different, but without this PR, it did screw them up
This is what I wrote in the JIRA description:

When Conversation Tracking is enabled with logTypes=All or logTypes=Conversations, metrics generated (such as bytes and packet rates) show much bigger numbers than expected.

@jotak
Copy link
Member Author

jotak commented Mar 12, 2025

@Amoghrd I took some screenshots to help:

  1. Without conversation tracking:
    Capture d’écran du 2025-03-12 09-23-58
    Traffic is mostly flat, except occasional spikes (here from openshift-marketplace namespace); around 2 MBps

  2. With logTypes: Conversations (and not this PR):
    Capture d’écran du 2025-03-12 09-28-45
    Traffic seems increasing out of control. It's wrong, the metrics are flawed.

  3. With this PR and still logTypes: Conversations:
    Capture d’écran du 2025-03-12 09-35-52
    Deploying this PR, traffic is back to expected levels, here around 2MBps

This is fixing wrong metrics that are due to accumulating bytes/packets
counters from heartbits, which are not relevant for accumulation.

However, conversation tracking is still bad at providing accurate
metrics for long-standing connections, as the ConnectionEnded event may
be very long to come. So this is going to remain as a known limitation
of Conversation Tracking.
@jotak jotak force-pushed the convtrack-metrics branch from e1ba68d to 625ed21 Compare March 12, 2025 08:44
@openshift-ci openshift-ci bot removed the lgtm label Mar 12, 2025
Copy link

openshift-ci bot commented Mar 12, 2025

New changes are detected. LGTM label has been removed.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 12, 2025
Copy link

openshift-ci bot commented Mar 12, 2025

@jotak: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-operator b7d6eaa link false /test e2e-operator
ci/prow/ci-index-noo-bundle b7d6eaa link true /test ci-index-noo-bundle

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants