Detect Outlier (LOF)
Synopsis
This operator identifies outliers in the given ExampleSet based on local outlier factors (LOF). The LOF is based on a concept of a local density, where locality is given by the
knearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers
Description
这个操作符执行LOF离群值搜索。LOF outliers or outliers with a local outlier factor per object are density based outliers according to Breunig, Kriegel, et al. As indicated by the name, the local outlier factor is based on a concept of a local density, where locality is given byknearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers. The local density is estimated by the typical distance at which a point can be 'reached' from its neighbors. The definition of 'reachability distance' used in LOF is an additional measure to produce more stable results within clusters.
The approach to find the outliers is based on measuring the density of objects and its relation to each other (referred to as local reachability density). Based on the average ratio of the local reachability density of an object and its k-nearest neighbors (i.e. the objects in its k-distance neighborhood), a local outlier factor (LOF) is computed. The approach takes a parameterMinPts(actually specifying the 'k') and it uses the maximum LOFs for objects in aMinPtsrange (lower bound and upper bound toMinPts).
This operator supports cosine, inverted cosine, angle and squared distance in addition to the usual euclidian distance which can be specified by thedistance functionparameter. In the first step, the objects are grouped into containers. For each object, using a radius screening of all other objects, all the available distances between that object and another object (or group of objects) on the same radius given by the distance are associated with a container. That container then has the distance information as well as the list of objects within that distance (usually only a few) and the information about how many objects are in the container.
In the second step, three things are done:
- The containers for each object are counted in ascending order according to the cardinality of the object list within the container (= that distance) to find the k-distances for each object and the objects in that k-distance (all objects in all the subsequent containers with a smaller distance).
- Using this information, the local reachability densities are computed by using the maximum of the actual distance and the k-distance for each object pair (object and objects in k-distance) and averaging it by the cardinality of the k-neighborhood and then taking the reciprocal value.
- The LOF is computed for eachMinPtsvalue in the range (actually for all up to upper bound) by averaging the ratio between the MinPts-local reachability-density of all objects in the k-neighborhood and the object itself. The maximum LOF in theMinPtsrange is passed as final LOF to each object.
An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case such examples should be discarded.
Input
example set input
This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can also be used as input.
Output
example set output
A new attribute 'outlier' is added to the given ExampleSet which is then delivered through this output port.
original
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Parameters
Minimal points lower bound
This parameter specifies the lower bound forMinPtsfor the Outlier test.
Minimal points upper bound
This parameter specifies the upper bound forMinPtsfor the Outlier test.
Distance function
This parameter specifies the distance function that will be used for calculating the distance between two objects.