Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ground Truth for Benchmark / Semi-Supervised #583

Closed
LuSchnitt opened this issue Nov 15, 2024 · 7 comments
Closed

Ground Truth for Benchmark / Semi-Supervised #583

LuSchnitt opened this issue Nov 15, 2024 · 7 comments
Labels
question Further information is requested

Comments

@LuSchnitt
Copy link

  • Orion version: 0.2.7
  • Python version: 3.11
  • Operating System: Windows

Question 1:
I want to make a Benchmark for some Pipelines, but how to set the ground truth in den input?
In this quickstart, we can use the evaluation function wiht ground_truth, but how to use in the benchmark-function?

Question 2:

As far as i understand, all these models work only in a unsupervised way, since all models are used in a regression-based way since they try to predict values from the time-series input which are part of the input. For models like auto-encoder, they can also wokr in a semi-supervised way, where they use some labeled data (Anomaly or not) to define/find a better threshhold to separate the distribution of normal data and anaomaly data.

Are all models limited to use them in a unsupervised way? And do i have access to the threshold value you use in https://sintel.dev/Orion/api_reference/api/orion.primitives.timeseries_anomalies.find_anomalies.html.

Since there is not exactly explained what this Threshiold is and how it will be computed. Can u explain or give me a reference where is is stated?

Best Lukas

@sarahmish
Copy link
Collaborator

Hi @LuSchnitt!

Q1: To use the benchmark, there is a load_anomalies function where it takes the name of the signal as input and returns a dataframe of the start and end timestamps of the ground truth anomalies. To use our benchmark on your custom dataset, please provide an anomalies.csv file in your data path that looks something similar to the labels dataframe provided in this notebook.

Q2: all pipelines in Orion are unsupervised and do not use anomalies in any part of the detection process, they are only used for evaluation in the benchmark. Even the autoencoders we support work on reconstructing the signal and are completely unsupervised.

The find_anomalies primitives works in two approaches: fixed threshold or dynamic threshold.

  • The fixed threshold approach is a simple moving window where any value that exceeds 4 standard deviations away from the mean is considered an anomalous value.
  • The dynamic threshold is the algorithm proposed by NASA, you can read their approach here.

Hope this clarifies your questions!

@sarahmish sarahmish added the question Further information is requested label Nov 15, 2024
@LuSchnitt
Copy link
Author

Hi @sarahmish,

Thank you for answering my questions so quickly. I have read a lot in the python code of the primitives and the paper you mentioned.
I think I understand everything now, thanks for your help!

You all did a great job documenting the python code! Could be that I missed it, but there is no option to return the dynamic threshold for each window of find_anomalies?

Best Lukas

@sarahmish
Copy link
Collaborator

We don't currently keep track of the threshold values that was used for each window.

Though there is a simple modification to the function

def _find_window_sequences(window, z_range, anomaly_padding, min_percent, window_start,

where we can return the threshold that was found for that window

return window_sequences, threshold

then you can store a list in find_anomalies function of all the threshold values that were used.
I believe that would make it possible for you to retrieve the threshold values.

Let me know if you have any further questions!

@LuSchnitt
Copy link
Author

Hello @sarahmish,

thanks for the help, i thougth about this too but wasn sure if i jinx some of the following functionality with it. I did it by saving the threshold to a file instead of adapting the return values.

In regar to Q1: Evaluation, i did like you said but unfortunately i got the same error again and again. I will add a picture with the structure of my anomalies and the error message.

error_evaluation_orion

@sarahmish
Copy link
Collaborator

The resulting anomalies dataframe should have two main columns:

  • signal: which contains the name of the signal
  • events: which is a list of tuples containing two elements.

for example, in the following screenshot, we have S-1 with only one anomaly (the list contains only one tuple). While P-1 has more than one.
Screenshot 2024-11-20 at 2 06 50 PM

then when you reference load_anomalies('S-1') you will get a dataframe containing the anomalies of S-1 as follows:
Screenshot 2024-11-20 at 2 14 41 PM

You raise an important issue though, load_anomalies should take a source file as argument to load your data's anomalies. As a workaround, you will need to have anomalies.csv in the expected path of the function which is /path/to/orion/data/anomalies.csv.

I will open an issue to mark the need for this functionality!

@LuSchnitt
Copy link
Author

LuSchnitt commented Nov 21, 2024

Thanks for replying!

So if i understand correctly, the signal coloumn contains information about specific anomalie-events. Since i only have infomation if a timestamp is anomalous or not this approach is unnecessary complex for my case.

But i found a way to use the <model>.evaluate method which fits perfect to my case!

I just got one more question.

The evaluate function takes an data arguemnt as well as as an train_data argument.

I understand the workaround in the following way: The model gets trained with train_data and afterwards tested with data, the found anomalies from data are then comapred with the ground_truth anomalies which only correspond to the data anomalies.
Am i correct?

edit:

Im sorry, i fotget about how the ground-truth anomalies are sperated into point and contextual, simple with the length? But then information about the time between measurements are neccessary.

edit edit:

Okay i looked at the code and apparently all anomalies are evaluated as contextual anomalies:

file: orion/core.py
line16: from orion.evaluation import CONTEXTUAL_METRICS as METRICS
line 295: metric: METRICS[metric](ground_truth, events, data=data)

Anyway much thanks to you for your support, i really like your framework and will cite it in my master-thesis which is about finding anomalies in building energy meter-data.

@sarahmish
Copy link
Collaborator

Yes that is correct @LuSchnitt, the orion.evaluate function is only evaluating contextual anomalies. Moreover, the train_data argument is only if you want to train your model on the training data prior to evaluating!

Thank you so much, best of like with your master thesis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants