Replace missing values with average in each cluster

painfuloverpainfulover MemberPosts:1Newbie
Hello,
I'm new to Rapidminer and I would like to replace missing values based on clustering, which means I have used k-means on columns which have no missing values and divide the original exampleset into 5 clusters. Now I would like know how to replace each row's missing values by the averages of the cluster it belongs to instead of the averages of whole attributes. I can only find the way to do the latter by the operator [replace missing values].
Thank you very much.

Best Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified ExpertPosts:949Unicorn
    Solution Accepted
    Hi!

    This process is a bit involved. You get a "cluster model" from the clustering operator that you can apply to the data with missing values. However, you need to choose an operator that can work with missing values itself. Then you would aggregate the clustered original data (the non-missing data), grouping by the cluster to get the averages. You can join the result with the missing values and use e. g. Loop Attributes to fill in the missing values using Generate Attributes with a formula likeif(missing(%{attr}), eval("average(" + %{attr} + ")"), %{attr}).

    It is much easier to use the Impute Missing Values operator that automates the selection of missing values, building a model for predicting them (you can select the model type) and putting the predicted values into the missing cells. There is an example process in the operator help that shows you how to use it, with k-NN as the example learning algorithm.

    Regards,
    Balázs
    painfulover
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Solution Accepted
    Hi,
    I would just use Group into Collection and create a collection (list) of example sets, where each example set only contains one cluster. You can then use Loop Collection and in there use Replace Missing to replace it with the respective means. Afterwards you just Append the resulting collection again.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    painfulover
    Sign InorRegisterto comment.