Detect Outlier (Clustering)

Synopsis

This operator is allows you to use cluster based methods for anomaly detection. It currently supports CBLOF, CMGOS and LDCOF

Description

CBLOF (Cluster-Based Local Outlier Factor): Calculates the outlier score based on cluster-based local outlier factor proposed by He et al[2003]. CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It categorizes the clusters into small clusters and large clusters using the parameters alpha and beta. The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster centroid. It uses weighting for CBLOF based on the sizes of the clusters as proposed in the original publication. Since this might lead to unexpected behavior (outliers close to small clusters are not found), it can be disabled and outliers scores are solely computed based on their distance to the cluster center.

CMGOS (Clustering-based Multivariate Gaussian Outlier Score): Calculates the outlier score based on a clustering result. The outlier score of an instance depends on the probability of how likely its distance to the cluster center is. This algorithm takes as input a clustered data set and a cluster model containing its centroids. Then, an outlier score is calculated on the basis of the centroid and the multivariate Gaussian of the cluster. Therefore, a covariance matrix of the multivariate Gaussian of each cluster is computed. Since covariance matrices are sensitive to outliers, different robust estimators exist, whereas this operator has different strategies: (1) Compute covariance matrix, remove outliers according to expected percentage and recompute covariance matrix. Basically this could be understood as a multivariate Grubb's Test. To cope with the challenge of not invertable matrices, two different sub-strategies have been implemented (1a) Reduction - reduces the number of dimensions by selecting only dimensions for each cluster which have at least two different values and (1a) Regularization - regularizes the covariance matrix with lambda (c.f. Friedman, J.H. (1989): Regularized Discriminant Analysis). (3) A robust covariance estimation according to Minimum Covariance Determinant (MCD) by Rousseeuw and Van Driessen as described in "A Fast Algorithm for the Minimum Covariance Determinant Estimator", 1999, has been implemented. Although the fastMCD was implemented, this algorithm is comparable slow.

LDCOF (Local Density Cluster-Based Outlier Factor): This is a local density based anomaly detection algorithm. The anomaly score is set to the distance to the nearest large cluster divided by the average cluster distance of the large cluster. The intuition behind this is that the small clusters are considered outlying and thus they are assigned to the nearest large cluster and this becomes its local neighborhood. The division into large and small clusters can be either done similar to what was implemented in the CBLOF paper (He et al,2003) or it can done in a manner similar to what was proposed in (Moh'd Belal Al- Zoubi,2009). This is determined by the parameter " divide clusters like CBLOF

Input

exa

The example set you want to run the algorithm on.

国防部

一个像km聚类模型eans clustering.

Output

exa

The scored example set.

国防部

An anomaly model which can be used to apply this model on new data.

clu

The initial clustering model passed through.

Parameters

Algorithm

Defines which algorithm you want to take. Currently CBLOF, CMGOS or LDCOF.

Alpha

alpha specifies the percentage of normal data

Beta

The minimum ratio between a normal and anomalous cluster

Use cluster size as weighting factor

Uses the cluster size as a weight factor as proposed by the original publication.

Divide clusters like cblof

如果设置为真,我们不会使用高山ha and beta, but use gamma instead (like in CBLOF).

Gamma (ldcof)

ratio between the maximum size of small clusters and the average cluster size. Small clusters are removed.

Lambda

Lambda for regularization (see Friedmann). A lambda of 0.0 menas QDA (each cluster has its own covariance) and a lambda of 1.0 means LDA (a global covariance matrix).

Covariance estimation

The algorithm to estimate the covariance matrics. Reduction is the simplest method whereas the other two are more complex. Details can be found in the papers (see Operator description).

H (non-outlier instances)

This parameter specifies the number of samples for fastMCD/MCD to be used for a computation (non-outliers). If set to -1 it is automatically computed according to the 'probability for normal class'. Friedmann et al recommend to use 75% of the examples as a good estimate. The upper bound is the numer of examples and the lower bound is (number of examples * dimensions +1)/2. Values exceeding these limits will be replaced by the limit.

Number of subsets

Defines the number of subsets used in fastMCD. Friedmann recommends to have at most 5 subsets.

Threshold for fastmcd

If the number of examples in the dataset exceeds the threshold, fastMCD will be applied instead of MCD (complete search). Not recommended to be higher than 600 due to computational issues.

Iterations

Number of iterations for computing the MCD. 100-500 might be a good choice.

Number of threads

"The number of threads for the computation"

Times to remove outlier

The number of times outlier should be removed for minimum covariance determinant

Probability for normal class

This is the expected probability of normal data instances. Usually it should be between 0.95 and 1.0 to make sense.

Limit computations

Limit the number of instances to calculate the covariance matrix. Should be used for very large clusters. The sampling of the instances is a random choice.

Maximum

Maximum number of instances for covariance matrix calculation

Parallelize evaluation process

Specifies that evaluation process should be performed in parallel

Gamma

Parameter name for gamma " ratio between the maximum size of small clusters and the average cluster size. Small clusters are removed