Remove attributes with missing values exceeding a given threshold (percentage)

f_lapernaf_laperna MemberPosts:13Contributor II
edited September 2019 inHelp

你好,我是新的快速矿工。我想做something very simple but I'm stuck with it. Given my data collection with many attributes I want to remove columns in which there are more than a given percentage of missing values (because I would not be able to use fixed values or infer their values). I tried the Remove Useless Attributes node but still I have columns with almost 90% of missing values so it didn't work as I wanted. Can you help me achieve what I want? It should be something trivial, I remember in Knime there was a specific option in the filter node to specify the percentage threshold.

Thank you!

mortiz Tghadially

Answers

  • FBTFBT MemberPosts:106Unicorn

    There are probably a few different ways of doing it, but the easiest I can come up with is using the "Remove Useless Attributes" operator. Please take a look at the example process below (just copy it and paste it into your XML panel, then click the green checkmark):









































    Tghadially
  • mortizmortiz MemberPosts:20.Maven
    I have the same question. If I have 100 attributes and 20 of them are missing 60% of the values, how can I easily scrub them out? The remove useless attribute operator doesn't seem to help with this.
    ZKuiper
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
    Hi@mortiz,

    It is very easy with TURBO PREP :

    - Open your dataset with Turbo Prep
    - Click onCLEANSE
    - Click onREMOVE LOW QUALITY
    - Set theMax missing(%)
    - Click onCOMMIT CLEANSE

    Hope this helps,

    Regards,

    Lionel
    mortiz IngoRM
  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
    Hi again@mortiz,

    If you don't have access to TURBO PREP, your task can be easily performed by a very simple Python script.
    To execute this process, you will need to :

    - Install Python on your computer.
    - Install the Python Scripting extension from the MarketPlace.
    - Set the Max Missing (%) values in a attribute (for this set the threshold calledthrin the Set Macros operator).

    The Process :

                                                      
    Regards,

    Lionel



    varunm1
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,381RM Data Scientist
    Hi@Moritz,@lionelderkrikor,

    there is a operator in toolbox called Select Attributes (Missings) or something like that which does the trick.

    BR
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    lionelderkrikor
  • mortizmortiz MemberPosts:20.Maven
    Thank you for the help. The Turbo Prep option seemed to help
    IngoRM Tghadially
Sign InorRegisterto comment.