Automatically detect deleted resources #1386

rafaelweingartner · 2024-05-28T20:17:08Z

This patch is created on top of #1385, #1381, #1387, and #1388. Therefore, we need to merge them first. Then, the extra commits here will disappear.

While executing some Gnocchi optimizations (#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called metric_inactive_after, which defines for how long a metric can go without receiving new data points until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.

run-upgrade-tests.sh

gnocchi/chef.py

rafaelweingartner · 2024-05-29T18:18:44Z

Thanks @pedro-martins !

gnocchi/chef.py

rafaelweingartner · 2024-07-03T15:38:38Z

@pedro-martins and @chungg I have rebased this PR and it is ready for your reviews

pedro-martins

Hi Rafael, nice feature, I just leave some comments here, but the code overall is good to me. Thanks for this patch.

gnocchi/chef.py

gnocchi/opts.py

run-upgrade-tests.sh

rafaelweingartner · 2024-07-03T18:39:37Z

Thanks @pedro-martins for your review!

gnocchi/chef.py

gnocchi/indexer/sqlalchemy.py

gnocchi/storage/__init__.py

gnocchi/chef.py

gnocchi/storage/__init__.py

gnocchi/opts.py

rafaelweingartner · 2024-07-09T12:07:14Z

@chungg thanks for your review! I have added the code changes you suggested, and there are some remarks open, which I would like to hear your response before closing them.

rafaelweingartner · 2024-09-17T12:46:31Z

Hello @chungg is there something else that needs to be addressed here?

rafaelweingartner · 2024-11-22T11:38:20Z

Hello guys,
Is there something missing here to move on and merge it?

This patch is the base for some further optimizations that we are doing and that we would like to propose upstream.

chungg

i'm ok with this although i think it'd be better if the update logic was in sql rather than python. something like following but in orm. could also be done later.

update resource r set ended_at = %s where not exists (select 1 from metric m where m.resource_id=a.id AND m.last_measure_timestamp >= %s) and r.ended_at is null;

gnocchi/storage/__init__.py

gnocchi/indexer/alembic/versions/f89ed2e3c2ec_create_last_measure_timestamp_column.py

Pull request has been modified.

rafaelweingartner · 2025-02-11T16:56:30Z

@chungg thanks for your review again. I guess there is nothing else to update here. Can we merge the patch? We have other on top of this, that we would like to start working on.

While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API. This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.

Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>

rafaelweingartner · 2025-02-14T16:39:33Z

Hello guys,
the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?

When I run here in my machine, it works 😄

Callum027 · 2025-02-14T18:17:48Z

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?

When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

rafaelweingartner · 2025-02-14T18:20:20Z

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?
When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

So it is not just me :)

Callum027 · 2025-02-14T18:25:04Z

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?
When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

So it is not just me :)

Yep, I've got retries set up in our internal CI pipelines for our Gnocchi deployment for the same reason.

There also seems to be a recurring error where the uWSGI build fails due to some weird system-related issues such as /bin/sh failing to run. In this case it seems like flakiness specific to the GitHub Actions runners.

Callum027 · 2025-02-14T18:42:17Z

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.

https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279

100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

rafaelweingartner · 2025-02-17T12:44:26Z

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.

https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279

100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

Cool, that gave me an idea. What do you think of a config like this one then in the tests?
8fc168f

This would repeat the test if it fails, and wait 1 seconds. Therefore, we could guarantee that the MetricD will already have been executed in the next try.

Callum027 · 2025-02-17T16:41:45Z

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.
https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279
100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

Cool, that gave me an idea. What do you think of a config like this one then in the tests? 8fc168f

This would repeat the test if it fails, and wait 1 seconds. Therefore, we could guarantee that the MetricD will already have been executed in the next try.

That's a great idea. There's normally two tests that are affected by that race condition so if we could retry them when they fail, that should solve it nicely without changing the current behaviour much.

rafaelweingartner · 2025-02-17T16:53:59Z

Thanks for the help @Callum027. @chungg and @tobias-urdin what do you think about this patch now?

Callum027 · 2025-03-12T03:04:01Z

Hi @chungg and @tobias-urdin , any chance of getting this and the other PRs I also have proposed looked at sometime soon?

Callum027 · 2025-03-12T03:30:27Z

Hi @rafaelweingartner, I have a question about this proposal.

Can a resource be "un-ended" if we need to start supplying metrics for that resource again (e.g. from Ceilometer)?

The particular use case I'm thinking about is Swift containers. The resource ID for a Swift container in Gnocchi is composed of its project ID and container name, which of course can be reused if you create a new container with the same name. Using this change to clean up old resources is quite beneficial I think, but if you're storing metrics for Swift container in your Gnocchi deployment I'm not sure this would be usable as-is.

rafaelweingartner · 2025-03-12T10:39:49Z

Hi @rafaelweingartner, I have a question about this proposal.

Can a resource be "un-ended" if we need to start supplying metrics for that resource again (e.g. from Ceilometer)?

The particular use case I'm thinking about is Swift containers. The resource ID for a Swift container in Gnocchi is composed of its project ID and container name, which of course can be reused if you create a new container with the same name. Using this change to clean up old resources is quite beneficial I think, but if you're storing metrics for Swift container in your Gnocchi deployment I'm not sure this would be usable as-is.

Yes, this usecase is considered. See: https://github.com/gnocchixyz/gnocchi/pull/1386/files#diff-9c436196247ad6c3c12d0b085d6e2ae7b57c54684a298457fdcf74c9ff3ac63eR700.

We also added this process due to other situations we have seen in the past, such as bad Ceilometer configurations, which led the system to loose monitoring data for almost a month, and so on. Therefore, with an implementation like this one, we would mark the resources as "finished", but then when the collection resumes, we would mark it as "alive" again.

tobias-urdin · 2025-03-14T15:43:22Z

I will merge this next week unless there is any objections from previous reviewers @chungg @Callum027 @pedro-martins

rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch 2 times, most recently from 67d710e to 9d08cc0 Compare May 28, 2024 20:24

GutoVeronezi reviewed May 29, 2024

View reviewed changes

run-upgrade-tests.sh Outdated Show resolved Hide resolved

pedro-martins reviewed May 29, 2024

View reviewed changes

gnocchi/chef.py Outdated Show resolved Hide resolved

gnocchi/chef.py Outdated Show resolved Hide resolved

gnocchi/chef.py Outdated Show resolved Hide resolved

pedro-martins reviewed Jun 11, 2024

View reviewed changes

gnocchi/chef.py Outdated Show resolved Hide resolved

rafaelweingartner mentioned this pull request Jun 27, 2024

Clean raw data metrics that do not receive measures #1385

Merged

rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 7764d2c to 6e76708 Compare July 3, 2024 15:16

pedro-martins reviewed Jul 3, 2024

View reviewed changes

gnocchi/chef.py Outdated Show resolved Hide resolved

gnocchi/chef.py Outdated Show resolved Hide resolved

gnocchi/chef.py Show resolved Hide resolved

gnocchi/opts.py Outdated Show resolved Hide resolved

run-upgrade-tests.sh Show resolved Hide resolved

chungg reviewed Jul 8, 2024

View reviewed changes

chungg previously approved these changes Nov 22, 2024

View reviewed changes

gnocchi/storage/__init__.py Outdated Show resolved Hide resolved

gnocchi/indexer/alembic/versions/f89ed2e3c2ec_create_last_measure_timestamp_column.py Outdated Show resolved Hide resolved

rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 9a7a623 to 1c4c7e9 Compare February 13, 2025 13:52

rafaelweingartner and others added 11 commits February 13, 2025 10:53

Fix pep8

1bffac0

add some comment

0df5c2c

address Pedro's review

65a087f

Update gnocchi/indexer/sqlalchemy.py with @chungg review

953515c

Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>

Update gnocchi/storage/__init__.py with @chungg review

f39c6dd

Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>

Update gnocchi/chef.py with @chungg review

11e39fa

Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>

Address some other reviews from @chungg

d9f1252

fix import

f8701fc

other fixes

f3bca5e

Update gnocchi/storage/__init__.py

f7c5e4d

Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>

Fix broken tests with timestamp

39dc1a3

rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch 2 times, most recently from 0dc60e5 to 84d9d91 Compare February 14, 2025 16:36

rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 8330f57 to 5d0bb33 Compare February 14, 2025 18:26

Adjust DB schema revision path

cd1d649

rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 5d0bb33 to cd1d649 Compare February 17, 2025 12:19

test

8fc168f

Callum027 mentioned this pull request Feb 17, 2025

Run metricd more frequently in functional tests #1443

Closed

tobias-urdin merged commit 5818ee0 into gnocchixyz:master Mar 18, 2025
33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically detect deleted resources #1386

Automatically detect deleted resources #1386

rafaelweingartner commented May 28, 2024 •

edited

Loading

rafaelweingartner commented May 29, 2024

rafaelweingartner commented Jul 3, 2024

pedro-martins left a comment

rafaelweingartner commented Jul 3, 2024

rafaelweingartner commented Jul 9, 2024

rafaelweingartner commented Sep 17, 2024

rafaelweingartner commented Nov 22, 2024

chungg left a comment

rafaelweingartner commented Feb 11, 2025

rafaelweingartner commented Feb 14, 2025

Callum027 commented Feb 14, 2025

rafaelweingartner commented Feb 14, 2025

Callum027 commented Feb 14, 2025

Callum027 commented Feb 14, 2025

rafaelweingartner commented Feb 17, 2025

Callum027 commented Feb 17, 2025

rafaelweingartner commented Feb 17, 2025

Callum027 commented Mar 12, 2025

Callum027 commented Mar 12, 2025

rafaelweingartner commented Mar 12, 2025

tobias-urdin commented Mar 14, 2025

Automatically detect deleted resources #1386

Automatically detect deleted resources #1386

Conversation

rafaelweingartner commented May 28, 2024 • edited Loading

rafaelweingartner commented May 29, 2024

rafaelweingartner commented Jul 3, 2024

pedro-martins left a comment

Choose a reason for hiding this comment

rafaelweingartner commented Jul 3, 2024

rafaelweingartner commented Jul 9, 2024

rafaelweingartner commented Sep 17, 2024

rafaelweingartner commented Nov 22, 2024

chungg left a comment

Choose a reason for hiding this comment

rafaelweingartner commented Feb 11, 2025

rafaelweingartner commented Feb 14, 2025

Callum027 commented Feb 14, 2025

rafaelweingartner commented Feb 14, 2025

Callum027 commented Feb 14, 2025

Callum027 commented Feb 14, 2025

rafaelweingartner commented Feb 17, 2025

Callum027 commented Feb 17, 2025

rafaelweingartner commented Feb 17, 2025

Callum027 commented Mar 12, 2025

Callum027 commented Mar 12, 2025

rafaelweingartner commented Mar 12, 2025

tobias-urdin commented Mar 14, 2025

rafaelweingartner commented May 28, 2024 •

edited

Loading