Detect Outlier (Clustering)
Synopsis
This operator is allows you to use cluster based methods for anomaly detection. It currently supports CBLOF, CMGOS and LDCOF
Description
CBLOF (Cluster-Based Local Outlier Factor): Calculates the outlier score based on cluster-based local outlier factor proposed by He et al[2003]. CBLOF takes as an input the data set and the cluster model that was generated by a clustering algorithm. It categorizes the clusters into small clusters and large clusters using the parameters alpha and beta. The anomaly score is then calculated based on the size of the cluster the point belongs to as well as the distance to the nearest large cluster centroid. It uses weighting for CBLOF based on the sizes of the clusters as proposed in the original publication. Since this might lead to unexpected behavior (outliers close to small clusters are not found), it can be disabled and outliers scores are solely computed based on their distance to the cluster center.
CMGOS (Clustering-based Multivariate Gaussian Outlier Score): Calculates the outlier score based on a clustering result. The outlier score of an instance depends on the probability of how likely its distance to the cluster center is. This algorithm takes as input a clustered data set and a cluster model containing its centroids. Then, an outlier score is calculated on the basis of the centroid and the multivariate Gaussian of the cluster. Therefore, a covariance matrix of the multivariate Gaussian of each cluster is computed. Since covariance matrices are sensitive to outliers, different robust estimators exist, whereas this operator has different strategies: (1) Compute covariance matrix, remove outliers according to expected percentage and recompute covariance matrix. Basically this could be understood as a multivariate Grubb's Test. To cope with the challenge of not invertable matrices, two different sub-strategies have been implemented (1a) Reduction - reduces the number of dimensions by selecting only dimensions for each cluster which have at least two different values and (1a) Regularization - regularizes the covariance matrix with lambda (c.f. Friedman, J.H. (1989): Regularized Discriminant Analysis). (3) A robust covariance estimation according to Minimum Covariance Determinant (MCD) by Rousseeuw and Van Driessen as described in "A Fast Algorithm for the Minimum Covariance Determinant Estimator", 1999, has been implemented. Although the fastMCD was implemented, this algorithm is comparable slow.
LDCOF (Local Density Cluster-Based Outlier Factor): This is a local density based anomaly detection algorithm. The anomaly score is set to the distance to the nearest large cluster divided by the average cluster distance of the large cluster. The intuition behind this is that the small clusters are considered outlying and thus they are assigned to the nearest large cluster and this becomes its local neighborhood. The division into large and small clusters can be either done similar to what was implemented in the CBLOF paper (He et al,2003) or it can done in a manner similar to what was proposed in (Moh'd Belal Al- Zoubi,2009). This is determined by the parameter " divide clusters like CBLOF
Input
exa
The example set you want to run the algorithm on.
国防部
一个像km聚类模型eans clustering.
Output
exa
The scored example set.
国防部
An anomaly model which can be used to apply this model on new data.
clu
The initial clustering model passed through.
Parameters
Algorithm
Defines which algorithm you want to take. Currently CBLOF, CMGOS or LDCOF.
Alpha
alpha specifies the percentage of normal data
Beta
The minimum ratio between a normal and anomalous cluster
Use cluster size as weighting factor
Uses the cluster size as a weight factor as proposed by the original publication.
Divide clusters like cblof
如果设置为真,我们不会使用高山ha and beta, but use gamma instead (like in CBLOF).
Gamma (ldcof)
ratio between the maximum size of small clusters and the average cluster size. Small clusters are removed.
Lambda
Lambda for regularization (see Friedmann). A lambda of 0.0 menas QDA (each cluster has its own covariance) and a lambda of 1.0 means LDA (a global covariance matrix).
Covariance estimation
The algorithm to estimate the covariance matrics. Reduction is the simplest method whereas the other two are more complex. Details can be found in the papers (see Operator description).
H (non-outlier instances)
This parameter specifies the number of samples for fastMCD/MCD to be used for a computation (non-outliers). If set to -1 it is automatically computed according to the 'probability for normal class'. Friedmann et al recommend to use 75% of the examples as a good estimate. The upper bound is the numer of examples and the lower bound is (number of examples * dimensions +1)/2. Values exceeding these limits will be replaced by the limit.
Number of subsets
Defines the number of subsets used in fastMCD. Friedmann recommends to have at most 5 subsets.
Threshold for fastmcd
If the number of examples in the dataset exceeds the threshold, fastMCD will be applied instead of MCD (complete search). Not recommended to be higher than 600 due to computational issues.
Iterations
Number of iterations for computing the MCD. 100-500 might be a good choice.
Number of threads
"The number of threads for the computation"
Times to remove outlier
The number of times outlier should be removed for minimum covariance determinant
Probability for normal class
This is the expected probability of normal data instances. Usually it should be between 0.95 and 1.0 to make sense.
Limit computations
Limit the number of instances to calculate the covariance matrix. Should be used for very large clusters. The sampling of the instances is a random choice.
Maximum
Maximum number of instances for covariance matrix calculation
Parallelize evaluation process
Specifies that evaluation process should be performed in parallel
Gamma
Parameter name for gamma " ratio between the maximum size of small clusters and the average cluster size. Small clusters are removed