Expectation Maximization Clustering

Synopsis

This operator performs clustering using the Expectation Maximization algorithm. Clustering is concerned with grouping objects together that are similar to each other and dissimilar to the objects belonging to other clusters. But the Expectation Maximization algorithm extends this basic approach to clustering in some important ways.

Description

The general purpose of clustering is to detect clusters in examples and to assign those examples to the clusters. A typical application for this type of analysis is a marketing research study in which a number of consumer behavior related variables are measured for a large sample of respondents. The purpose of the study is to detect 'market segments', i.e., groups of respondents that are somehow more similar to each other (to all other members of the same cluster) when compared to respondents that belong to other clusters. In addition to identifying such clusters, it is usually equally of interest to determine how the clusters are different, i.e., determine the specific variables or dimensions that vary and how they vary in regard to members in different clusters.

The EM (expectation maximization) technique is similar to the K-Means technique. The basic operation of K-Means clustering algorithms is relatively simple: Given a fixed number ofk集群s, assign observations to those clusters so that the means across clusters (for all variables) are as different from each other as possible. The EM algorithm extends this basic approach to clustering in two important ways:

Instead of assigning examples to clusters to maximize the differences in means for continuous variables, the EM clustering algorithm computes probabilities of cluster memberships based on one or more probability distributions. The goal of the clustering algorithm then is to maximize the overall probability or likelihood of the data, given the (final) clusters.

Expectation Maximization algorithm

The basic approach and logic of this clustering method is as follows. Suppose you measure a single continuous variable in a large sample of observations. Further, suppose that the sample consists of two clusters of observations with different means (and perhaps different standard deviations); within each sample, the distribution of values for the continuous variable follows the normal distribution. The goal of EM clustering is to estimate the means and standard deviations for each cluster so as to maximize the likelihood of the observed data (distribution). Put another way, the EM algorithm attempts to approximate the observed distributions of values based on mixtures of different distributions in different clusters. The results of EM clustering are different from those computed by k-means clustering. The latter will assign observations to clusters to maximize the distances between clusters. The EM algorithm does not compute actual assignments of observations to clusters, but classification probabilities. In other words, each observation belongs to each cluster with a certain probability. Of course, as a final result you can usually review an actual assignment of observations to clusters, based on the (largest) classification probability.

Differentiation

k-Means

The K-Means operator performs clustering using the k-means algorithm. k-means clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters. Objects in one cluster are similar to each other. The similarity between objects is based on a measure of the distance between them. The K-Means operator assigns observations to clusters to maximize the distances between clusters. The Expectation Maximization Clustering operator, on the other hand, computes classification probabilities.

Input

example set

The input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

集群model

This port delivers the cluster model which has information regarding the clustering performed. It has information about cluster probabilities and cluster means.

集群ed set

The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute withidrole is added to the input ExampleSet to distinguish examples. An attribute with集群role may also be added depending on the state of theadd cluster attributeparameter. If theshow probabilitiesparameter is set to true, one probability column is added for each cluster.

Parameters

K

This parameter specifies the number of clusters to form. There is no hard and fast rule of number of clusters to form. But, generally it is preferred to have small number of clusters with examples scattered (not too scattered) around them in a balanced way.

Add cluster attribute

If enabled, a new attribute with集群直接作用生成in this operator, otherwise this operator does not add the集群attribute. In the latter case you have to use the Apply Model operator to generate the集群attribute.

Add as label

If true, the cluster id is stored in an attribute with thelabelrole instead of集群role (seeadd cluster attributeparameter).

Remove unlabeled

If set to true, unlabeled examples are deleted.

Max runs

This parameter specifies the maximal number of runs of this operator to be performed with random initialization.

马克斯优化步骤

This parameter specifies the maximal number of iterations performed for one run of this operator.

Quality

This parameter specifies the quality that must be fulfilled before the algorithm stops ( i.e. the rising of the log-likelihood that must be undercut).

Use local random seed

This parameter indicates if alocal random seedshould be used for randomization.

Local random seed

This parameter specifies thelocal random seed. This parameter is only available if theuse local random seedparameter is set to true.

显示概率

This parameter indicates if the probabilities for every cluster should be inserted with every example in the ExampleSet.

Inital distribution

This parameter indicates the initial distribution of the centroids.

Correlated attributes

This parameter should be set to true if the ExampleSet contains correlated attributes.