Test & Validation Data - Unsupervised
Dear all,
I would like to drive an Unsupervised analysis on a Data Set, which later can be pursued by a Supervised Analysis.
I think I do not need to split my data into Training & Test sets for the Unsupervised part (Clustering, Association or regression).
What do you think?
I would like to drive an Unsupervised analysis on a Data Set, which later can be pursued by a Supervised Analysis.
I think I do not need to split my data into Training & Test sets for the Unsupervised part (Clustering, Association or regression).
What do you think?
0
Best Answer
-
jacobcybulski Member, University ProfessorPosts:391UnicornIn many non-RM environments a typical approach to clustering is to create a k-means clustering and then use it to create a classifier, such as k-NN to be used to assign cluster values to new examples. It is also commonly practiced to create a classification model based on you cluster labels and then check the accuracy of this classification. However, the approach described above is not pure as clustering and classification seek different objectives, especially if your clustering and classification use different methods (e.g. density based clustering and a decision tree classification). So in theory you can cluster your entire data set, create a classifier based on those "labels" and even use the classifier to predict the cluster membership and assess its performance. My advice would be to cluster your data, use the performance measures specific for the clustering system, and then utilise the clustering model generated in the process so that it could be applied to new data (in exactly the same way as your classifiers).Jacob
1
Answers