Skip to main content

Detect Outlier (COF)

Synopsis

This operator identifies outliers in the given ExampleSet based on the Class Outlier Factors (COF).

Description

The main concept of an ECODB (Enhanced Class Outlier - Distance Based) algorithm is to rank each instance in the ExampleSet given the parametersN(topNclass outliers), andK(the number of nearest neighbors). The rank of each instance is found using the formula:

COF = PCL(T,K) - norm(deviation(T)) + norm(kDist(T))

  • PCL(T,K)is the Probability of the Class Label of the instanceTwith respect to the class labels of itsKnearest neighbors.
  • norm(Deviation(T))andnorm(KDist(T))are the normalized values ofDeviation(T)andKDist(T)respectively and their values fall in the range[0 - 1].
  • Deviation(T)is how much the instanceTdeviates from instances of the same class. It is computed by summing the distances between the instanceTand every instance belonging to the same class.
  • KDist(T)is the summation of the distance between the instanceTand itsKnearest neighbors.

This operator adds a new boolean attribute named 'outlier' to the given ExampleSet. If the value of this attribute is true, that example is an outlier and vice versa. Another special attribute 'COF Factor' is also added to the ExampleSet. This attribute measures the degree of being Class Outlier for an example.

An outlier is an example that is numerically distant from the rest of the examples of the ExampleSet. An outlying example is one that appears to deviate markedly from other examples of the ExampleSet. Outliers are often (not always) indicative of measurement error. In this case such examples should be discarded.

Input

example set input

This input port expects an ExampleSet. It is the output of the Generate Data operator in the attached Example Process. The output of other operators can also be used as input.

Output

example set output

A new boolean attribute 'outlier' and a real attribute 'COF Factor' is added to the given ExampleSet and the ExampleSet is delivered through this output port.

original

The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

Number of neighbors

This parameter specifies thekvalue for theknearest neighbors to be the analyzed. The minimum and maximum values for this parameter are 1 and 1 million respectively.

Number of class outliers

This parameter specifies the number of top-n Class Outliers to be looked for. The resultant ExampleSet will havennumber of examples that are considered outliers. The minimum and maximum values for this parameter are 2 and 1 million respectively.

Measure types

This parameter is used for selecting the type of measure to be used for measuring the distance between points.The following options are available:mixed measures,nominal measures,numerical measuresandBregman divergences.

Mixed measure

This parameter is available when themeasure typeparameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance'

Nominal measure

This parameter is available when themeasure typeparameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. In this case the 'numerical measure' option should be selected.

Numerical measure

This parameter is available when themeasure typeparameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. If the input ExampleSet has nominal attributes the 'nominal measure' option should be selected.

Divergence

This parameter is available when themeasure typeparameter is set to 'Bregman divergences'.

乡下人l type

This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:

  • dot: The dot kernel is defined byk(x,y)=x*yi.e.it is inner product ofxandy.
  • radial: The radial kernel is defined byexp(-g ||x-y||^2)wheregis thegammathat is specified by thekernel gammaparameter. The adjustable parametergamma在核心的性能上起着重要的作用l, and should be carefully tuned to the problem at hand.
  • polynomial:定义的多项式内核k(x,y)=(x*y+1)^dwheredis the degree of the polynomial and it is specified by thekernel degreeparameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
  • neural: The neural kernel is defined by a two layered neural nettanh(a x*y+b)whereaisalphaandbis theintercept constant. These parameters can be adjusted using thekernel aandkernel bparameters. A common value foralphais 1/N, where N is the data dimension. Note that not all choices ofaandblead to a valid kernel function.
  • sigmoid: This is the sigmoid kernel. Please note that thesigmoidkernel is not valid under some parameters.
  • anova: This is the anova kernel. It has adjustable parametersgammaanddegree.
  • epachnenikov: The Epanechnikov kernel is this function(3/4)(1-u2)forubetween -1 and 1 and zero foruoutside that range. It has two adjustable parameterskernel sigma1andkernel degree.
  • gaussian_combination: This is the gaussian combination kernel. It has adjustable parameterskernel sigma1, kernel sigma2andkernel sigma3.
  • multiquadric: The multiquadric kernel is defined by the square root of||x-y||^2 + c^2. It has adjustable parameterskernel sigma1andkernel sigma shift.

乡下人l gamma

This is the SVM kernel parameter gamma. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toradialor方差分析。

乡下人l sigma1

This is the SVM kernel parameter sigma1. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toepachnenikov,gaussian combinationormultiquadric.

乡下人l sigma2

This is the SVM kernel parameter sigma2. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set togaussian combination.

乡下人l sigma3

This is the SVM kernel parameter sigma3. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set togaussian combination.

乡下人l shift

This is the SVM kernel parameter shift. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set tomultiquadric.

乡下人l degree

This is the SVM kernel parameter degree. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set topolynomial,anovaorepachnenikov.

乡下人l a

This is the SVM kernel parameter a. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toneural.

乡下人l b

This is the SVM kernel parameter b. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toneural.