Dynamically determine number of clusters k-means

namachoco99namachoco99 MemberPosts:3Contributor I
edited December 2018 inHelp

I have a CSV file containing approximately a million records and 3 features that will be used to determine which cluster each record will belong. I want to have these records clustered using k-Means algorithm (and using the Euclidean Distance) and I'll use the Davies Bouldin Index (DBI) to find the optimal number of clusters.

Is there any way for me to be able to automate finding the optimal number of clusters by repeating/looping through the process with the k nmber of clusters incrementing on each iteration? I'm new to RapidMiner so I'm not yet sure on how to implement this by implementing an XML code.

Thanks for any help and suggestion that will be given!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,419RM Data Scientist

    Hello,

    there are two things you can do.

    1. Use the X-Means operator. It runs k-means but uses internally heuristics (i think based on DB?) to determine k

    2. Put a loop around and run the algorithm with several k. You can than pick the best k. I've done this in a blog post on hearthstone a year ago://www.turtlecreekpls.com/creative-use-hearthstone-cluster-analysis/this also includes some python scripts for charting, which you might not need.

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    sgenzer
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1635年Unicorn

    I have had good results using the X-Means operator. It generally finds a sensible value of k in my experience.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer Muhammed_Fatih_
  • namachoco99namachoco99 MemberPosts:3Contributor I

    This one looks promising! Thanks for the suggestion!

    I have two questions though:

    1. Will this output some sort of a table that will display the DBI values for each k number of clusters? That's because I would need to store all the results and create a graph using those values.

    2. Additionally, do you have any idea how long this one runs? I found a similar code somewhere in the same forum but the code runs somewhere between 6-12 hours per iteration and my goal is to have a range of 2-100 clusters, if possible.

    Currently, my dataset is composed of about 1.9 million records with only 4 columns (a column label and 3 other columns which will be used for clustering, all normalized already).

    Thanks again!

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1635年Unicorn

    At this thread (http://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Forum/How-to-reuse-preprocessing-results-in-a-range-of-k-means/m-p/40191) there is an example of a process using a loop to set parameters. If you run the sample process on the sonar data (as supplied) you can see that the output is a collection, where each element corresponds to one of the clusters at the different k-values you have supplied. If you want to do other things like calculate the DBI and store that for each output, then you'll need to add the appropriate operators from YY's process inside the loop. After that you can have other operators to pull all that data together and append it into a single dataset where you can graph the results.

    As far as runtime is concerned, that can vary significantly depending on the quality of the hardware you are running. 4 attributes is not a lot for clustering but 1.9MM records is, so I am not surprised to hear it is taking a while. You might consider taking a smaller sample and then doing your k-optimization on that dataset so you only have to apply the single selected k-value to the entire dataset once.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    Mosyafa
  • bhupendra_patilbhupendra_patil Administrator, Employee, MemberPosts:168RM Data Scientist

    You can also use the "Log" operator to see the results of every iteration

    Details in this document here

    http://community.www.turtlecreekpls.com/t5/RapidMiner-Studio-Knowledge-Base/Capture-intermediate-results-during-optimization/ta-p/32083

    See below example















    <运营商激活= " true " class = " k_means”兼容ibility="7.6.000" expanded="true" height="82" name="Clustering" width="90" x="112" y="34">



























    <描述一致= =“绿色”“左”颜色色= " true" height="173" resized="true" width="626" x="309" y="166">Davies-Bouldin Index evaluates intra-cluster similarity and inter-cluster differences. If you consider these to be good criteria, go for the Davies-Bouldin. The Silhouette Index measure the distance between each data point, the centroid of the cluster it was assigned to and the closest centroid belonging to another cluster. If you consider that this is a good criterion, go for the silhouette index.<br><br>How can we say that a clustering quality measure is good?. Available from:https://www.researchgate.net/post/How_can_we_say_that_a_clustering_quality_measure_is_good.>













登录orRegisterto comment.