"Evaluating Cluster Performance"
Hi Everybody,
I'm ploughing my nose through the more basic clustering literature and I wonder that there is no basic criteria to evalute the overall quality of a clustering.
So far, the criteria I came across demand either a lot of calculation (silhouette coefficient) or are not a very good help in the decision on the number of cluster if the number of data points is reasonably high (e.g.Akaike,Schwarz Information Criterion, MDL). (Actually I don't want to reduce 10.000 to 200 Clusters but to a number I can still interpret). The problem far me is that these criteria didn't seem to punish additional cluster severly enough in order to propose the solution with more clusters.
Currently I just guess I didn't hit the right literature so far, has anybody a clue good reference.
During the last days the idea came to my mind to compare the performance of a clustering of the real data with a given number of clusters (k) with the result that would be achieved with the same number of clusters on an artificial set.
In order to be reasonable candidate for a the final number of clusters k should perform significantly better on the real data set than on the white noise one. This lead me to following formula, which yields the minimum share of the dataset's variancesthat should be explained by a given number of clusters, if the white noise is generated by uniform distribution.
s(d,k) = 1 - ((SUM(rn(d)^2))^0.5/(k*D^0.5))^(2/D)
s:= explained minimum share of the dataset's variance
= number of data dimension
k:= number of clusters
d element of D
r(d):= range of dimension d
rm(d):= maximum range of all considered dimension
rn(d):= r(d)/rm(d) [standardized range of the dimension]
In the end s(d,k) or its derivation could be used as very conservative knock out criteria to evaluate the quality of the clustering on real data, since their benchmarks should always be exceeded (at least in the case of kmeans).
In this context I have to questions:
A) is my idea correct?
Does anbody know the generalization to a normal distribution?
Thank you in advance for any comment.
Best regards
Norbert
I'm ploughing my nose through the more basic clustering literature and I wonder that there is no basic criteria to evalute the overall quality of a clustering.
So far, the criteria I came across demand either a lot of calculation (silhouette coefficient) or are not a very good help in the decision on the number of cluster if the number of data points is reasonably high (e.g.Akaike,Schwarz Information Criterion, MDL). (Actually I don't want to reduce 10.000 to 200 Clusters but to a number I can still interpret). The problem far me is that these criteria didn't seem to punish additional cluster severly enough in order to propose the solution with more clusters.
Currently I just guess I didn't hit the right literature so far, has anybody a clue good reference.
During the last days the idea came to my mind to compare the performance of a clustering of the real data with a given number of clusters (k) with the result that would be achieved with the same number of clusters on an artificial set.
In order to be reasonable candidate for a the final number of clusters k should perform significantly better on the real data set than on the white noise one. This lead me to following formula, which yields the minimum share of the dataset's variancesthat should be explained by a given number of clusters, if the white noise is generated by uniform distribution.
s(d,k) = 1 - ((SUM(rn(d)^2))^0.5/(k*D^0.5))^(2/D)
s:= explained minimum share of the dataset's variance
= number of data dimension
k:= number of clusters
d element of D
r(d):= range of dimension d
rm(d):= maximum range of all considered dimension
rn(d):= r(d)/rm(d) [standardized range of the dimension]
In the end s(d,k) or its derivation could be used as very conservative knock out criteria to evaluate the quality of the clustering on real data, since their benchmarks should always be exceeded (at least in the case of kmeans).
In this context I have to questions:
A) is my idea correct?
Does anbody know the generalization to a normal distribution?
Thank you in advance for any comment.
Best regards
Norbert
Tagged:
0
Answers
Did you have a look into the DBindex? This criterion should take the number of clusters at least a bit into account. Another idea is a multi-objective optimization of K and some criterion where K should be minimized. You can have a look into my 2005/2006 feature-selection/construction-for-clustering-literature if you are interested in those approaches.
欢呼,
Ingo
Thank you for your instant reply and your hints.
Yes,s(d,k)increases with an increasing number of clusters.
我这背后的思想s criterion is thats(d,k)depicts the increase in the clustering quality which would be seen in white noise data.
Therefore one could use the development ofs(d,k)and compare it with the observed development of the clustering quality in the real data set. One then could reject additional clusters if they improve the quality only by a fraction which would be observed in a white noise setting.
Looking at DBIndex which of the 18 options have you implemented in RM?
(see (F. Azuaje, "A cluster validity framework for genome expression data,"
Bioinfomatics, vol. 18, pp. 319-320, 2002)
Best regards
Norbert
About the DB-Index: here is the relevant code (I hope at least) from the class CentroidBasedEvaluator: Maybe this helps.
欢呼,
Ingo