K-Medoids

Synopsis

This operator performs clustering using the

k-medoidsalgorithm. Clustering is concerned with grouping objects together that are similar to each other and dissimilar to the objects belonging to other clusters. Clustering is a technique for extracting information from unlabelled data. k-medoids clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters.

Description

This operator performs clustering using the k-medoids algorithm. K-medoids clustering is an exclusive clustering algorithm i.e. each object is assigned to precisely one of a set of clusters. Objects in one cluster are similar to each other. The similarity between objects is based on a measure of the distance between them.

Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. It is a technique for extracting information from unlabeled data and can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior.

Here is a simple explanation of how the k-medoids algorithm works. First of all we need to introduce the notion of the center of a cluster, generally called its centroid. Assuming that we are using Euclidean distance or something similar as a measure we can define the centroid of a cluster to be the point for which each attribute value is the average of the values of the corresponding attribute for all the points in the cluster. The centroid of a cluster will always be one of the points in the cluster. This is the major difference between the k-means and k-medoids algorithm. In the k-means algorithm the centroid of a cluster will frequently be an imaginary point, not part of the cluster itself, which we can take to mark its center. For more information about the k-means algorithm please study the k-means operator.

Differentiation

k-Means

In case of the k-medoids algorithm the centroid of a cluster will always be one of the points in the cluster. This is the major difference between the k-means and k-medoids algorithm. In the k-means algorithm the centroid of a cluster will frequently be an imaginary point, not part of the cluster itself, which we can take to mark its center.

Input

example set input

The input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

集群model

This port delivers the cluster model. It has information regarding the clustering performed. It tells which examples are part of which cluster. It also has information regarding centroids of each cluster.

集群ed set

The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute withidrole is added to the input ExampleSet to distinguish examples. An attribute with集群role may also be added depending on the state of theadd cluster attribute参数。

Parameters

Add cluster attribute

If enabled, a new attribute with集群role is generated directly in this operator, otherwise this operator does not add the集群attribute. In the latter case you have to use the Apply Model operator to generate the集群attribute.

Add as label

If true, the cluster id is stored in an attribute with thelabelrole instead of集群role (seeadd cluster attributeparameter).

Remove unlabeled

If set to true, unlabeled examples are deleted.

K

This parameter specifies the number of clusters to form. There is no hard and fast rule of number of clusters to form. But, generally it is preferred to have a small number of clusters with examples scattered (not too scattered) around them in a balanced way.

Max runs

This parameter specifies the maximal number of runs of k-medoids with random initialization that are performed.

Max optimization steps

This parameter specifies the maximal number of iterations performed for one run of k-medoids.

Use local random seed

Indicates if alocal random seedshould be used for randomization. Randomization may be used for selectingkdifferent points at the start of the algorithm as potential centroids.

Local random seed

This parameter specifies thelocal random seed. This parameter is only available if theuse local random seedparameter is set to true.

Measure types

This parameter is used for selecting the type of measure to be used for measuring the distance between points.The following options are available:mixed measures,nominal measures,numerical measuresandBregman divergences.

Mixed measure

This parameter is available when themeasure typeparameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance'

Nominal measure

This parameter is available when themeasure typeparameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. In this case the 'numerical measure' option should be selected.

Numerical measure

This parameter is available when themeasure typeparameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. If the input ExampleSet has nominal attributes the 'nominal measure' option should be selected.

Divergence

This parameter is available when themeasure typeparameter is set to 'bregman divergences'.

Kernel type

This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:

dot: The dot kernel is defined byk(x,y)=x*yi.e.it is inner product ofxandy.
radial: The radial kernel is defined byexp(-g ||x-y||^2)wheregis thegammathat is specified by thekernel gamma参数。The adjustable parametergammaplays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
polynomial:定义的多项式内核k(x,y)=(x*y+1)^dwheredis the degree of the polynomial and it is specified by thekernel degree参数。多项式kernels are well suited for problems where all the training data is normalized.
neural: The neural kernel is defined by a two layered neural nettanh(a x*y+b)whereaisalphaandbis theintercept constant. These parameters can be adjusted using thekernel aandkernel bparameters. A common value foralphais 1/N, where N is the data dimension. Note that not all choices ofaandblead to a valid kernel function.
sigmoid: This is the sigmoid kernel. Please note that thesigmoidkernel is not valid under some parameters.
anova: This is the anova kernel. It has adjustable parametersgammaanddegree.
epachnenikov: The Epanechnikov kernel is this function(3/4)(1-u2)forubetween -1 and 1 and zero foruoutside that range. It has two adjustable parameterskernel sigma1andkernel degree.
gaussian_combination: This is the gaussian combination kernel. It has adjustable parameterskernel sigma1, kernel sigma2andkernel sigma3.
multiquadric: The multiquadric kernel is defined by the square root of||x-y||^2 + c^2. It has adjustable parameterskernel sigma1andkernel sigma shift.

Kernel gamma

This is the SVM kernel parameter gamma. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toradialoranova.

Kernel sigma1

This is the SVM kernel parameter sigma1. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toepachnenikov,gaussian combinationormultiquadric.

Kernel sigma2

This is the SVM kernel parameter sigma2. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set togaussian combination.

Kernel sigma3

This is the SVM kernel parameter sigma3. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set togaussian combination.

Kernel shift

This is the SVM kernel parameter shift. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set tomultiquadric.

Kernel degree

This is the SVM kernel parameter degree. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set topolynomial,anovaorepachnenikov.

Kernel a

This is the SVM kernel parameter a. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toneural.

Kernel b

This is the SVM kernel parameter b. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toneural.