Skip to main content

检测离群值(距离)

Synopsis

This operator identifies

noutliers in the given ExampleSet based on the distance to theirknearest neighbors. The variablesnandkcan be specified through parameters.

Description

This operator performs outlier search according to the outlier detection approach recommended by Ramaswamy, Rastogi and Shim in "Efficient Algorithms for Mining Outliers from Large Data Sets". In their paper, a formulation for distance-based outliers is proposed that is based on the distance of a point from itsk-thnearest neighbor. Each point is ranked on the basis of its distance to itsk-thnearest neighbor and the topnpoints in this ranking are declared to be outliers. The values ofkandncan be specified by thenumber of neighborsandnumber of outliersparameters respectively. This search is based on simple and intuitive distance-based definitions for outliers by Knorr and Ng which in simple words is: 'A pointpin a data set is an outlier with respect two parameterskanddif no more thankpoints in the data set are at a distance ofdor less fromp'.

This operator adds a new boolean attribute named 'outlier' to the given ExampleSet. If the value of this attribute is true that example is an outlier and vice versa.nexamples will have the value true in the 'outlier' attribute (wherenis the value specified in thenumber of outliersparameter). Different distance functions are supported by this operator. The desired distance function can be selected by thedistance functionparameter.

An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case such examples should be discarded.

Input

example set input

This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can also be used as input.

Output

example set output

A new boolean attribute 'outlier' is added to the given ExampleSet and the ExampleSet is delivered through this output port.

original

The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

Number of neighbors

This parameter specifies thekvalue for thek-thnearest neighbors to be the analyzed. The minimum and maximum values for this parameter are 1 and 1 million respectively.

outl数量iers

This parameter specifies the number of top-n outliers to be looked for. The resultant ExampleSet will havennumber of examples that are considered outliers. The minimum and maximum values for this parameter are 2 and 1 million respectively.

Distance function

This parameter specifies the distance function that will be used for calculating the distance between two examples.