Master Thesis - Anomaly Detection via Isolation Forest Embedding

You can look at the Executive_Summary (a 6 pages paper-like summary) to have a more detailed overview of the thesis.

Introduction

Anomaly Detection (AD) is a Data Mining process and consists finding unusual patterns or rare observations in a set of data. Usually anomalies represent negative events, in fact anomaly detection is used in many different fields, from medicine to industry.

We faced the problem by taking as starting point a milestone AD algorithm: Isolation Forest (iForest).

This thesis propose to use the intermediate output of iForest to create an embedding, hence a new data representation on which known classification or anomaly detection techniques can be applied. Our empirical evaluation shows that our approach performs just as well, and sometimes better, than iForest on the same data. But our most important result is the creation of a new framework to enable other techniques to improve the anomaly detection performance.

Isolation Forest

iForest is an unsupervised model-based method for anomaly detection. This method represent a breakthrough, before iForest the usual approach to AD problems was: construct a normal data profile, then test unseen data instances and identify as anomalies the instances that do not conform to the normal profile. iForest differs from all the previous ones since it is based on the idea of directly isolates anomalies, instead of recognized them as far from the normal data profile.

This approach works because anomalies are more susceptible to isolation than normal instances: a normal instance requires much more partitions than an anomaly to be isolated. iForest assigns an anomaly score to each instance based on the number of splits required to isolate them.

The model is based on a trees ensemble, each tree is called Isolation Tree (iTree). In each iTree shortest paths (few splits) identify anomalies, while longest ones (more splits) predict normal instances.

Proposed Solution

We introduce a new embedding that gives to input data x a new representation, but first of all introduce some definitions:

depths vector y: intermediate output of iForest, y_i is the returned depth of the i-th iTree;
histogram h: histogram of depths vector y, then it is normalized: ||h||₁=1.

Let's summarize how to obtain the histogram h from input instance x: $x \in \mathbb{R}^d \xrightarrow{\hspace{3px}iForest\hspace{3px}} y \in \mathbb{R}^t \xrightarrow{\hspace{3px}histogram\hspace{3px}} h \in \mathbb{Q}^n$

Using the embedding, and so using this new representation of input data, we expect normal and anomalous instances to yield different histograms, i.e. anomalous instances have high frequencies for bins representing low depths, while normal instances have high frequencies for bins representing high depths.

We use this new embedding to represent data in a totally different way, based on the iForest output, with the goal of perform other anomaly detection techniques in this new space and reach results better than the starting point iForest.

On the Executive_Summary (a 6 pages paper-like summary) a more detailed overview of the thesis, with the experiments and the obtained results.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
Implementation		Implementation
Executive_Summary.pdf		Executive_Summary.pdf
Frontespizio.pdf		Frontespizio.pdf
Presentation.pdf		Presentation.pdf
README.md		README.md
Thesis.pdf		Thesis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master Thesis - Anomaly Detection via Isolation Forest Embedding

Introduction

Isolation Forest

Proposed Solution

About

Releases

Packages

Languages

manuelsalamino/Master-Thesis

Folders and files

Latest commit

History

Repository files navigation

Master Thesis - Anomaly Detection via Isolation Forest Embedding

Introduction

Isolation Forest

Proposed Solution

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages