Isolation Forest

Synopsis

This operator finds outliers using an Isolation Forest.

Description

Isolation Forests are anomaly detection algorithms. Unlike normal Random Forests they do not need any label to work on. Their purpose is to detect if data points are similar to the training data or not.

In order to do this an isolation tree takes a random attribute and performs a random split on the attribute. The data is send to the two child nodes and the same strategy is performed until either the chosen attribute is constant or there are not enough examples in the node.

一个隔离森林的中心思想是,that "normal" data has long paths along a tree. While abnormal data tend to have short average paths across the forest.

The user has currently two ways how he can determine an anomaly score. If you choose average_path you will receive the average path until a given example reaches a leaf. If you choose normalized_score this score will be normalized like this: score = pow(2,-1*l/c) where l is the average length and c is the theoretical depth of a tree.

For more details, please have a look at the original paper:https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf

Input

example set

The input ExampleSet.

Output

example set

The resulting output ExampleSet with scores

model

The isolation forest model you can use to apply it on a different data set.

Parameters

Number of trees

The number of trees in the forest

Max leaf size

Maximal number of examples in a leaf. Used as stopping criterion.

Score calculation

What option to use to generate the score.