Skip to main content

DBSCAN

Synopsis

This operator performs clustering with DBSCAN. DBSCAN (for density-based spatial clustering of applications with noise) is a density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes.

Description

DBSCAN's definition of a cluster is based on the notion of density reachability. Basically, a pointqis directly density-reachable from a pointpif it is not farther away than a given distance epsilon (i.e. it is part of its epsilon-neighborhood) and ifpis surrounded by sufficiently many points such that one may considerpandqbe part of a cluster.qis called density-reachable (note the distinction from "directly density-reachable") frompif there is a sequencep(1),…,p(n)of points withp(1) = pandp(n) = qwhere eachp(i+1)is directly density-reachable fromp(i).

Note that the relation of density-reachable is not symmetric.qmight lie on the edge of a cluster, having insufficiently many neighbors to count as dense itself. This would halt the process of finding a path that stops with the first non-dense point. By contrast, starting the process withqwould lead top(though the process would halt there,pbeing the first non-dense point). Due to this asymmetry, the notion of density-connected is introduced: two pointspandqare density-connected if there is a pointosuch that bothpandqare density-reachable fromo. Density-connectedness is symmetric.

A cluster, which is a subset of the points of the data set, satisfies two properties:

  1. All points within the cluster are mutually density-connected.
  2. If a point is density-connected to any point of the cluster, it is part of the cluster as well.

DBSCAN requires two parameters: epsilon and the minimum number of points required to form a cluster (minPts). epsilon and minPts can be specified through theepsilonandmin pointsparameters respectively. DBSCAN starts with an arbitrary starting point that has not been visited. This point's epsilon-neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized epsilon-environment of a different point and hence be made part of a cluster.

If a point is found to be a dense part of a cluster, its epsilon-neighborhood is also part of that cluster. Hence, all points that are found within the epsilon-neighborhood are added, as is their own epsilon-neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster or noise.

If no id attribute is present, this operator will create one. The 'Cluster 0' assigned by DBSCAN operator corresponds to points that are labeled as noise. These are the points that have less thanmin pointspoints in their epsilon-neighborhood.

Clustering is concerned with grouping together objects that are similar to each other and dissimilar to the objects belonging to other clusters. It is a technique for extracting information from unlabeled data and can be very useful in many different scenarios e.g. in a marketing application we may be interested in finding clusters of customers with similar buying behavior.

Input

example set

This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process.

Output

集群model

This port delivers the cluster model. It has information regarding the clustering performed. It tells which examples are part of which cluster.

集群ed set

The ExampleSet that was given as input is passed with minor changes to the output through this port. An attribute withidrole is added to the input ExampleSet to distinguish examples. An attribute with集群role may also be added depending on the state of theadd cluster attributeparameter.

Parameters

Epsilon

This parameter specifies the epsilon parameter of the DBSCAN algorithm. epsilon specifies the size of the neighborhood.

Min points

This parameter specifies the minimal number of points forming a cluster.

Add cluster attribute

如果这个参数设置为true, a new attribute with集群role is generated in the resultant ExampleSet, otherwise this operator does not add the集群attribute. In the latter case you have to use the Apply Model operator to generate the集群attribute.

Add as label

如果这个参数设置为true, the cluster id is stored in an attribute with thelabelrole instead of集群role (seeadd cluster attributeparameter).

Remove unlabeled

如果这个参数设置为true, unlabeled examples are deleted from the ExampleSet.

Measure types

This parameter is used for selecting the type of measure to be used for measuring the distance between points.The following options are available:mixed measures,nominal measures,numerical measuresandBregman divergences.

Mixed measure

This parameter is available when themeasure typeparameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance'

Nominal measure

This parameter is available when themeasure typeparameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. In this case the 'numerical measure' option should be selected.

Numerical measure

This parameter is available when themeasure typeparameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. If the input ExampleSet has nominal attributes the 'nominal measure' option should be selected.

散度

This parameter is available when themeasure typeparameter is set to 'bregman divergences'.

Kernel type

This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:

  • dot: The dot kernel is defined byk(x,y)=x*yi.e.it is inner product ofxandy.
  • radial: The radial kernel is defined byexp(-g ||x-y||^2)wheregis thegammathat is specified by thekernel gammaparameter. The adjustable parametergammaplays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
  • polynomial:定义的多项式内核k(x,y)=(x*y+1)^dwheredis the degree of the polynomial and it is specified by thekernel degreeparameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
  • neural: The neural kernel is defined by a two layered neural nettanh(a x*y+b)whereaisalphaandbis theintercept constant. These parameters can be adjusted using thekernel aandkernel bparameters. A common value foralphais 1/N, where N is the data dimension. Note that not all choices ofaandblead to a valid kernel function.
  • sigmoid: This is the sigmoid kernel. Please note that thesigmoidkernel is not valid under some parameters.
  • anova: This is the anova kernel. It has adjustable parametersgammaanddegree.
  • epachnenikov: The Epanechnikov kernel is this function(3/4)(1-u2)forubetween -1 and 1 and zero foruoutside that range. It has two adjustable parameterskernel sigma1andkernel degree.
  • gaussian_combination: This is the gaussian combination kernel. It has adjustable parameterskernel sigma1, kernel sigma2andkernel sigma3.
  • multiquadric: The multiquadric kernel is defined by the square root of||x-y||^2 + c^2. It has adjustable parameterskernel sigma1andkernel sigma shift.

Kernel gamma

This is the SVM kernel parameter gamma. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toradialoranova.

Kernel sigma1

This is the SVM kernel parameter sigma1. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toepachnenikov,gaussian combinationormultiquadric.

Kernel sigma2

This is the SVM kernel parameter sigma2. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set togaussian combination.

Kernel sigma3

This is the SVM kernel parameter sigma3. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set togaussian combination.

Kernel shift

This is the SVM kernel parameter shift. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set tomultiquadric.

Kernel degree

This is the SVM kernel parameter degree. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set topolynomial,anovaorepachnenikov.

Kernel a

This is the SVM kernel parameter a. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toneural.

Kernel b

This is the SVM kernel parameter b. This parameter is only available when thenumerical measureparameter is set to 'Kernel Euclidean Distance' and thekernel typeparameter is set toneural.