how to cluster missing values in one cluster?

LeMarcLeMarc MemberPosts:72Contributor II
edited April 2020 inHelp
Hello,

I would like to have 2 clusters from a data set. Basically one with examples that have missing values and the other with examples which dont have any missing values. As most Clustering algorithms do not allow missing values in data set, those missing values could be replaces by e.g. "0" . However still after that I m clueless on what exactly to do afterwards to have all missing values in one cluster.
Can anyone help?

Thank you!

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,381RM Data Scientist
    Hi,
    why can't you just take a Filter examples operator with "missing_attribute" as filter? That should do the trick.

    Cheers,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • LeMarcLeMarc MemberPosts:72Contributor II

    yes that would be easy. Its just my task to cluster all missing values and not to use the filter operator.:smile:
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,381RM Data Scientist
    Hi,
    well, you can set them to -100000 and then just cluster on it without normalizing. Should do the same trick.

    ~Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • LeMarcLeMarc MemberPosts:72Contributor II
    Thank you@mschmitz, will try it!
  • jacobcybulskijacobcybulski Member, University ProfessorPosts:391Unicorn
    edited April 2020
    I am not sure if replacing missing values with big numbers will cluster these examples together. However, I feel that you can train an svm radial classifier to separate them from the rest (the intuition is that they'd be all far from the centre of your data).
  • LeMarcLeMarc MemberPosts:72Contributor II
    @jacobcybulskiThank you for your input! Im going to try your suggestion!
  • jacobcybulskijacobcybulski Member, University ProfessorPosts:391Unicorn
    @LeMarc, I am not sure if you have much experience with SVMs, if not, do not get discouraged if the initial results are very poor. You will need to run some optimisation of SVM kernel hyper-parameters. The radial kernel may work here, and if not try anova kernel, which is more sophisticated, and is commonly optimised on kernel.gamma, kernel.degree and C.
    Jacob
    LeMarc
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    The most common clustering algorithms are not really designed to handle missing values. So you may be able to "trick" these algorithms into creating a cluster by using artificially high or low values but a better approach would be to use a different method altogether, one designed to actually separate cases and that can more directly handle missing values. Several of the earlier posts have recommended some of these approaches.
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • jacobcybulskijacobcybulski Member, University ProfessorPosts:391Unicorn
    @Telcontar120, indeed I think this is a discovery project on how to handle missing values differently. The idea with SVM was that if you replace missing values with some big numbers (at least for numerical ones) then all examples with missing values will be pushed far from data centre, in which case an SVM with a radial kernel may help isolate them.
  • LeMarcLeMarc MemberPosts:72Contributor II
    Thank you all@jacobcybulskiand @ Telecontar120 thank you for your input! Thanks for your explanation of how SVM works! Yes indeed, this is a discovery project of mine!
Sign InorRegisterto comment.