Ground Truth for Benchmark / Semi-Supervised #583

LuSchnitt · 2024-11-15T12:34:27Z

Orion version: 0.2.7
Python version: 3.11
Operating System: Windows

Question 1:
I want to make a Benchmark for some Pipelines, but how to set the ground truth in den input?
In this quickstart, we can use the evaluation function wiht ground_truth, but how to use in the benchmark-function?

Question 2:

As far as i understand, all these models work only in a unsupervised way, since all models are used in a regression-based way since they try to predict values from the time-series input which are part of the input. For models like auto-encoder, they can also wokr in a semi-supervised way, where they use some labeled data (Anomaly or not) to define/find a better threshhold to separate the distribution of normal data and anaomaly data.

Are all models limited to use them in a unsupervised way? And do i have access to the threshold value you use in https://sintel.dev/Orion/api_reference/api/orion.primitives.timeseries_anomalies.find_anomalies.html.

Since there is not exactly explained what this Threshiold is and how it will be computed. Can u explain or give me a reference where is is stated?

Best Lukas

sarahmish · 2024-11-15T16:26:56Z

Hi @LuSchnitt!

Q1: To use the benchmark, there is a load_anomalies function where it takes the name of the signal as input and returns a dataframe of the start and end timestamps of the ground truth anomalies. To use our benchmark on your custom dataset, please provide an anomalies.csv file in your data path that looks something similar to the labels dataframe provided in this notebook.

Q2: all pipelines in Orion are unsupervised and do not use anomalies in any part of the detection process, they are only used for evaluation in the benchmark. Even the autoencoders we support work on reconstructing the signal and are completely unsupervised.

The find_anomalies primitives works in two approaches: fixed threshold or dynamic threshold.

The fixed threshold approach is a simple moving window where any value that exceeds 4 standard deviations away from the mean is considered an anomalous value.
The dynamic threshold is the algorithm proposed by NASA, you can read their approach here.

Hope this clarifies your questions!

LuSchnitt · 2024-11-18T09:43:30Z

Hi @sarahmish,

Thank you for answering my questions so quickly. I have read a lot in the python code of the primitives and the paper you mentioned.
I think I understand everything now, thanks for your help!

You all did a great job documenting the python code! Could be that I missed it, but there is no option to return the dynamic threshold for each window of find_anomalies?

Best Lukas

sarahmish · 2024-11-18T17:09:45Z

We don't currently keep track of the threshold values that was used for each window.

Though there is a simple modification to the function

Orion/orion/primitives/timeseries_anomalies.py

Line 364 in 47c1353

    
           def _find_window_sequences(window, z_range, anomaly_padding, min_percent, window_start,

where we can return the threshold that was found for that window

return window_sequences, threshold

then you can store a list in find_anomalies function of all the threshold values that were used.
I believe that would make it possible for you to retrieve the threshold values.

Let me know if you have any further questions!

LuSchnitt · 2024-11-19T14:59:11Z

Hello @sarahmish,

thanks for the help, i thougth about this too but wasn sure if i jinx some of the following functionality with it. I did it by saving the threshold to a file instead of adapting the return values.

In regar to Q1: Evaluation, i did like you said but unfortunately i got the same error again and again. I will add a picture with the structure of my anomalies and the error message.

sarahmish · 2024-11-20T19:17:46Z

The resulting anomalies dataframe should have two main columns:

signal: which contains the name of the signal
events: which is a list of tuples containing two elements.

for example, in the following screenshot, we have S-1 with only one anomaly (the list contains only one tuple). While P-1 has more than one.

then when you reference load_anomalies('S-1') you will get a dataframe containing the anomalies of S-1 as follows:

You raise an important issue though, load_anomalies should take a source file as argument to load your data's anomalies. As a workaround, you will need to have anomalies.csv in the expected path of the function which is /path/to/orion/data/anomalies.csv.

I will open an issue to mark the need for this functionality!

LuSchnitt · 2024-11-21T09:09:03Z

Thanks for replying!

So if i understand correctly, the signal coloumn contains information about specific anomalie-events. Since i only have infomation if a timestamp is anomalous or not this approach is unnecessary complex for my case.

But i found a way to use the <model>.evaluate method which fits perfect to my case!

I just got one more question.

The evaluate function takes an data arguemnt as well as as an train_data argument.

I understand the workaround in the following way: The model gets trained with train_data and afterwards tested with data, the found anomalies from data are then comapred with the ground_truth anomalies which only correspond to the data anomalies.
Am i correct?

edit:

Im sorry, i fotget about how the ground-truth anomalies are sperated into point and contextual, simple with the length? But then information about the time between measurements are neccessary.

edit edit:

Okay i looked at the code and apparently all anomalies are evaluated as contextual anomalies:

file: orion/core.py
line16: from orion.evaluation import CONTEXTUAL_METRICS as METRICS
line 295: metric: METRICS[metric](ground_truth, events, data=data)

Anyway much thanks to you for your support, i really like your framework and will cite it in my master-thesis which is about finding anomalies in building energy meter-data.

sarahmish · 2024-11-25T16:15:00Z

Yes that is correct @LuSchnitt, the orion.evaluate function is only evaluating contextual anomalies. Moreover, the train_data argument is only if you want to train your model on the training data prior to evaluating!

Thank you so much, best of like with your master thesis!

sarahmish added the question Further information is requested label Nov 15, 2024

sarahmish mentioned this issue Nov 20, 2024

Custom `anomalies.csv source file for load_anomalies` #585

Open

LuSchnitt closed this as completed Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ground Truth for Benchmark / Semi-Supervised #583

Ground Truth for Benchmark / Semi-Supervised #583

LuSchnitt commented Nov 15, 2024

sarahmish commented Nov 15, 2024

LuSchnitt commented Nov 18, 2024

sarahmish commented Nov 18, 2024

LuSchnitt commented Nov 19, 2024

sarahmish commented Nov 20, 2024

LuSchnitt commented Nov 21, 2024 •

edited

Loading

sarahmish commented Nov 25, 2024

Ground Truth for Benchmark / Semi-Supervised #583

Ground Truth for Benchmark / Semi-Supervised #583

Comments

LuSchnitt commented Nov 15, 2024

sarahmish commented Nov 15, 2024

LuSchnitt commented Nov 18, 2024

sarahmish commented Nov 18, 2024

LuSchnitt commented Nov 19, 2024

sarahmish commented Nov 20, 2024

LuSchnitt commented Nov 21, 2024 • edited Loading

sarahmish commented Nov 25, 2024

LuSchnitt commented Nov 21, 2024 •

edited

Loading