technical question about the combined use of clustering and classification
大家好!我是一个新手rapidminer confronted a problem regarding the combined use of the clustering and classification.
基本上,我想m的k - means集群发展y initial dataset and then further build models to perform the classification and evaluate their performance for EACH of the clusters. I know how to use the operators to perform cluster analysis and classification respectively but have no idea how to deploy the operators to combine them. I tried many ways such as placing the k-means operators before or within the cross-validation but still fail to either run it successfully or get the performance result of each cluster. Can anyone help?
Any response would be greatly appreciated
Thank you!
基本上,我想m的k - means集群发展y initial dataset and then further build models to perform the classification and evaluate their performance for EACH of the clusters. I know how to use the operators to perform cluster analysis and classification respectively but have no idea how to deploy the operators to combine them. I tried many ways such as placing the k-means operators before or within the cross-validation but still fail to either run it successfully or get the performance result of each cluster. Can anyone help?
Any response would be greatly appreciated
Thank you!
1
Answers
Are you using one of the performances operators dedicated to clustering (A priori theCluster Distance Performancefor k-Means) :
Regards,
Lionel
Thank you for your replay
And yeah, I tried "Cluster Distance Performance" in my process but found out it was just for evaluating the cluster (e.g. telling me the Davies-Bouldin index of the cluster) while the result I want is to see the performance (say, accuracy) in each cluster. Do I misunderstand those operators?
Thanks!
I think you have to Generate a "prediction attribute" from your clustering results to perform the correspondence between
the cluster(s) results and the classes of your label.
EDIT :
I'm using the Iris Dataset. To be more precise on the methodology , I 'm c lustering the different examples, and then label each cluster using the majority label of the labelled examples in that cluster.
You can see what I mean by opening and running the process in attached file.
Hope this helps,
Regards,
Lionel
Big thanks for your explanation and example!
But I came up with two questions regarding your provided process:
1. In the training section of the cross-validation operator, it uses simply one clustering operator to train the model. I am wondering why we don't need to put any model for classification (e.g. decision tree or neural net) as the whole dataset contains the labelled attribute, which should thus be used as supervised learning? ( In my imagination, if I want to do classification in each of the clusters, I should have used both clustering operator and classification model?)
2. In the testing section of the cross-validation operator, you use generate attribute to assign the label to each cluster. Does that mean that instead of assigning the label using the classification model, we should assign the label manually (where, I found some inconsistency, e.g. cluster 0 contains both Iris-versicolor & Iris-virginica, but you only assign the cluster 0 to Iris-versicolor?)?
Thank you so much!
Belle
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
To answer to your question :
I can say in conclusion that I learn new things everyday on RapidMiner...
Thanks for sharing this operator, Brian !
Regards,
Lionel