You can look at the Executive_Summary (a 6 pages paper-like summary) to have a more detailed overview of the thesis.
Anomaly Detection (AD) is a Data Mining process and consists finding unusual patterns or rare observations in a set of data. Usually anomalies represent negative events, in fact anomaly detection is used in many different fields, from medicine to industry.
We faced the problem by taking as starting point a milestone AD algorithm: Isolation Forest (iForest).
This thesis propose to use the intermediate output of iForest to create an embedding, hence a new data representation on which known classification or anomaly detection techniques can be applied. Our empirical evaluation shows that our approach performs just as well, and sometimes better, than iForest on the same data. But our most important result is the creation of a new framework to enable other techniques to improve the anomaly detection performance.
iForest is an unsupervised model-based method for anomaly detection. This method represent a breakthrough, before iForest the usual approach to AD problems was: construct a normal data profile, then test unseen data instances and identify as anomalies the instances that do not conform to the normal profile. iForest differs from all the previous ones since it is based on the idea of directly isolates anomalies, instead of recognized them as far from the normal data profile.
This approach works because anomalies are more susceptible to isolation than normal instances: a normal instance requires much more partitions than an anomaly to be isolated. iForest assigns an anomaly score to each instance based on the number of splits required to isolate them.
The model is based on a trees ensemble, each tree is called Isolation Tree (iTree). In each iTree shortest paths (few splits) identify anomalies, while longest ones (more splits) predict normal instances.
We introduce a new embedding that gives to input data x a new representation, but first of all introduce some definitions:
- depths vector y: intermediate output of iForest, yi is the returned depth of the i-th iTree;
- histogram h: histogram of depths vector y, then it is normalized: ||h||1=1.
Let's summarize how to obtain the histogram h from input instance x:
Using the embedding, and so using this new representation of input data, we expect normal and anomalous instances to yield different histograms, i.e. anomalous instances have high frequencies for bins representing low depths, while normal instances have high frequencies for bins representing high depths.
We use this new embedding to represent data in a totally different way, based on the iForest output, with the goal of perform other anomaly detection techniques in this new space and reach results better than the starting point iForest.
On the Executive_Summary (a 6 pages paper-like summary) a more detailed overview of the thesis, with the experiments and the obtained results.