Isolation Forest
Synopsis
This operator finds outliers using an Isolation Forest.
Description
Isolation Forests are anomaly detection algorithms. Unlike normal Random Forests they do not need any label to work on. Their purpose is to detect if data points are similar to the training data or not.
In order to do this an isolation tree takes a random attribute and performs a random split on the attribute. The data is send to the two child nodes and the same strategy is performed until either the chosen attribute is constant or there are not enough examples in the node.
一个隔离森林的中心思想是,that "normal" data has long paths along a tree. While abnormal data tend to have short average paths across the forest.
The user has currently two ways how he can determine an anomaly score. If you choose average_path you will receive the average path until a given example reaches a leaf. If you choose normalized_score this score will be normalized like this: score = pow(2,-1*l/c) where l is the average length and c is the theoretical depth of a tree.
For more details, please have a look at the original paper:https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf
Input
example set
The input ExampleSet.
Output
example set
The resulting output ExampleSet with scores
model
The isolation forest model you can use to apply it on a different data set.
Parameters
Number of trees
The number of trees in the forest
Max leaf size
Maximal number of examples in a leaf. Used as stopping criterion.
Score calculation
What option to use to generate the score.