Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect deleted resources #1386

Conversation

rafaelweingartner
Copy link
Contributor

@rafaelweingartner rafaelweingartner commented May 28, 2024

This patch is created on top of #1385, #1381, #1387, and #1388. Therefore, we need to merge them first. Then, the extra commits here will disappear.

While executing some Gnocchi optimizations (#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called metric_inactive_after, which defines for how long a metric can go without receiving new data points until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch 2 times, most recently from 67d710e to 9d08cc0 Compare May 28, 2024 20:24
@rafaelweingartner
Copy link
Contributor Author

Thanks @pedro-martins !

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 7764d2c to 6e76708 Compare July 3, 2024 15:16
@rafaelweingartner
Copy link
Contributor Author

@pedro-martins and @chungg I have rebased this PR and it is ready for your reviews

Copy link
Contributor

@pedro-martins pedro-martins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rafael, nice feature, I just leave some comments here, but the code overall is good to me. Thanks for this patch.

@rafaelweingartner
Copy link
Contributor Author

Thanks @pedro-martins for your review!

@rafaelweingartner
Copy link
Contributor Author

@chungg thanks for your review! I have added the code changes you suggested, and there are some remarks open, which I would like to hear your response before closing them.

@rafaelweingartner
Copy link
Contributor Author

Hello @chungg is there something else that needs to be addressed here?

@rafaelweingartner
Copy link
Contributor Author

Hello guys,
Is there something missing here to move on and merge it?

This patch is the base for some further optimizations that we are doing and that we would like to propose upstream.

chungg
chungg previously approved these changes Nov 22, 2024
Copy link
Member

@chungg chungg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm ok with this although i think it'd be better if the update logic was in sql rather than python. something like following but in orm. could also be done later.

update resource r set ended_at = %s where not exists (select 1 from metric m where m.resource_id=a.id AND m.last_measure_timestamp >= %s) and r.ended_at is null;

@mergify mergify bot dismissed chungg’s stale review February 11, 2025 16:55

Pull request has been modified.

@rafaelweingartner
Copy link
Contributor Author

@chungg thanks for your review again. I guess there is nothing else to update here. Can we merge the patch? We have other on top of this, that we would like to start working on.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 9a7a623 to 1c4c7e9 Compare February 13, 2025 13:52
rafaelweingartner and others added 11 commits February 13, 2025 10:53
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch 2 times, most recently from 0dc60e5 to 84d9d91 Compare February 14, 2025 16:36
@rafaelweingartner
Copy link
Contributor Author

Hello guys,
the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?

When I run here in my machine, it works 😄

@Callum027
Copy link
Contributor

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?

When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

@rafaelweingartner
Copy link
Contributor Author

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?
When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

So it is not just me :)

@Callum027
Copy link
Contributor

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?
When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

So it is not just me :)

Yep, I've got retries set up in our internal CI pipelines for our Gnocchi deployment for the same reason.

There also seems to be a recurring error where the uWSGI build fails due to some weird system-related issues such as /bin/sh failing to run. In this case it seems like flakiness specific to the GitHub Actions runners.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 8330f57 to 5d0bb33 Compare February 14, 2025 18:26
@Callum027
Copy link
Contributor

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.

https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279

100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 5d0bb33 to cd1d649 Compare February 17, 2025 12:19
@rafaelweingartner
Copy link
Contributor Author

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.

https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279

100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

Cool, that gave me an idea. What do you think of a config like this one then in the tests?
8fc168f

This would repeat the test if it fails, and wait 1 seconds. Therefore, we could guarantee that the MetricD will already have been executed in the next try.

@Callum027
Copy link
Contributor

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.
https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279
100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

Cool, that gave me an idea. What do you think of a config like this one then in the tests? 8fc168f

This would repeat the test if it fails, and wait 1 seconds. Therefore, we could guarantee that the MetricD will already have been executed in the next try.

That's a great idea. There's normally two tests that are affected by that race condition so if we could retry them when they fail, that should solve it nicely without changing the current behaviour much.

@rafaelweingartner
Copy link
Contributor Author

Thanks for the help @Callum027. @chungg and @tobias-urdin what do you think about this patch now?

@Callum027
Copy link
Contributor

Hi @chungg and @tobias-urdin , any chance of getting this and the other PRs I also have proposed looked at sometime soon?

@Callum027
Copy link
Contributor

Hi @rafaelweingartner, I have a question about this proposal.

Can a resource be "un-ended" if we need to start supplying metrics for that resource again (e.g. from Ceilometer)?

The particular use case I'm thinking about is Swift containers. The resource ID for a Swift container in Gnocchi is composed of its project ID and container name, which of course can be reused if you create a new container with the same name. Using this change to clean up old resources is quite beneficial I think, but if you're storing metrics for Swift container in your Gnocchi deployment I'm not sure this would be usable as-is.

@rafaelweingartner
Copy link
Contributor Author

Hi @rafaelweingartner, I have a question about this proposal.

Can a resource be "un-ended" if we need to start supplying metrics for that resource again (e.g. from Ceilometer)?

The particular use case I'm thinking about is Swift containers. The resource ID for a Swift container in Gnocchi is composed of its project ID and container name, which of course can be reused if you create a new container with the same name. Using this change to clean up old resources is quite beneficial I think, but if you're storing metrics for Swift container in your Gnocchi deployment I'm not sure this would be usable as-is.

Yes, this usecase is considered. See: https://github.com/gnocchixyz/gnocchi/pull/1386/files#diff-9c436196247ad6c3c12d0b085d6e2ae7b57c54684a298457fdcf74c9ff3ac63eR700.

We also added this process due to other situations we have seen in the past, such as bad Ceilometer configurations, which led the system to loose monitoring data for almost a month, and so on. Therefore, with an implementation like this one, we would mark the resources as "finished", but then when the collection resumes, we would mark it as "alive" again.

@tobias-urdin
Copy link
Contributor

I will merge this next week unless there is any objections from previous reviewers @chungg @Callum027 @pedro-martins

@tobias-urdin tobias-urdin merged commit 5818ee0 into gnocchixyz:master Mar 18, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants