Feature Selection Part 1: Feature Weighting

MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,417RM Data Scientist
edited November 2018 inKnowledge Base

Feature Selection Part 1: Feature Weighting

In some use cases you may be interested in figuring out which attribute(s) are important to predict a given label. This attribute performance can be a result by itself, because it can tell you what reasons make someone or something behave in this way. In this article we will discuss common techniques to find these feature weights.

Filter Methods

One of the most used methods to find an attribute importance is to use a statistical measure to define importance. Often used measures are Correlation, Gini Index or Information gain. In RapidMiner you can calculate these values using theWeight byOperators.

Feature Weight1.png

由此产生的啊bject is a weight vector. The weight vector is the central object for all feature weightening operations. If we have a look at it, it looks like this:

Feature Weight2.pngThere are two operators which are important for the use of weight objects.Weights to Dataconverts this table into an example set. This can then be exported into Excel or a database.

Select by Weightsallows you to select attributes using this weights. You can for example select attributes only having higher weights than 0.1 or take the k top ones.

Including Non Linear Attributes and Combinations

The Filter methods above have the problem to not incorporate non-linearities. A technique to overcome this is to generate non-linear combinations of the same attribute. The operatorGenerate Function Setcan be used to generate things like pow(Age,2) or sqrt(Age) and combination between these. This operator is usually combined withRename by Constructionto get readable names.Feature Weight3.png

Handling Dependencies

Another known issue with the filter methods are dependencies between the variables. If you data set contains Age, 2xAge,3xAge and 4xAge all of them might get a height feature weight. A technique which overcomes this issue would beMRMR. MRMR is included inRapidMiner's Feature Selection Extension.

Model Based Feature Weights

得到特征权重的另一种方法是使用国防部el. Some models are able to provide a weight vector by themselves. These values are telling you how important an attribute was for the learner itself. The concrete calculation scheme is different for all learners. Weight vectors are provided by these opertors:

  • Linear Regression
  • Generalized Linear Model
  • Gradient Boosted Tree
  • Support Vector Machine (only with linear kernel)
  • Logistic Regression
  • Logistic Regression (SVM)

It is generally advisable to tune the parameters (and choice) of these operators for a maximal prediction accuracy before taking the weights.

A special case is the Random Forest operator. A Random Forest model can be feeded into aWeight By Tree Importanceoperator to get a feature weight.

Feature Weight4.png

Feature Selection Methods

Besides Feature Weighting you can also use a Feature Selection Techniques. The difference is, that you only get a weight vector with 1 if it in the set of chosen attributes and a 0 if it is not in. The most common techniques for this are also wrapper methods namely Forward Selection and Backwards Elemination.

Polynominal Classification Problems and Clustering

In polynominal Classification problems it is often useful to do this in a one vs all fashion. This answers the question "what makes group A different from all the other"? A variation of this method includes to apply it on cluster labels to get cluster descriptions.

Evolutionary Feature Generation

Another sophisticated approach to incorporate non-linearities and also interaction terms (e.g. find a dependency like sqrt(age-weight)) is to use a evolutionary feature generation approach. The operators in RapidMiner are Yagga and Yagga2. Please have a look at the operator info for more details.

Prescriptive Analytics

Instead of generation a general feature weight you can also find individual dependencies for a single example. In this case you would vary the variables in an example and check the influence predicted by your model. A common use case is to check whether it's worth to call a customer or not by checking his individual scoring result with and without a call.

乐鱼体育安装

- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany
kypexin yyhuang Montse phivu michaelgloven MarcoBarradas

Comments

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Great article@mschmitz! A question that has come up several times in my training classes is whether there is any operator to provide attribute importance for a multivariate scoring model. As you mention above, some modeling algorithms output weights directly, but many popular ones do not (e.g. decision tree, neural net, naive bayes). In those cases, I usually tell the students that they can take one of the other approaches discussed in this article (such as the "weight by" operators discussed in the first section, which speak to indepedent weight but don't address weight in the context of a specific model).

    In theory, another approach is to take the list of all model attributes and remove them one at a time from the final model to see the resulting deterioration in model performance, and rank them accordingly (where the attribute that leads to the greatest decrease in performance has the highest weight, and all other attributes' weights are scaled to that). This can of course be done manually (and even done with loops to cut down on repetitive operations), but it would be nice if RapidMiner added an operator to do this automatically for any model and output the resulting table as a set of weights. Just an idea for a future product enhancement!

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,417RM Data Scientist

    Thanks@Telcontar120,

    i aggree that there are some more methods which will be put into this article over time. Your proposal is totally fine. I often call then n-1 models. The whole behaviour is pretty similar to Backwards Elemenination.


    As another point - Breiman et. al proposes not to delete them, but only "noise" one attribute during testing by shuffeling. I think this technique has some nice advantages over removing attributes (e.g. better handling of colinearities). I got somewhere a process for this... Need to search:)

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Thanks@mschmitz,当然我知道其实不是a simple, singular way of answering the question "what is the variable importance in my model?" So multiple ways of answering that question may be appropriate. Nevertheless it is a very common question in my experience, so it would be nice if RapidMiner had a convenient and easy-to-explain approach to offer its users. As you call it, the "n-1" approach is quite similar to Backwards Elimination, and in fact, if the intermediate output from the existing Backward Elimination operator could be made available, then that might be a very easy way for RapidMiner to create an operator that would provide one perspective on that question. And of course if you find a relevant process and post it here, I would be quite grateful :-)

    Cheers,

    Brian

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, MemberPosts:291Unicorn

    Hi Martin,

    Regarding this part:

    "The operator Generate Function Set can be used to generate things like pow(Age,2) or sqrt(Age) and combination between these."

    Do I get it right that if we need to overcome non-linearity usinbg this method, we also should discard original attributes after transformation (uncheck 'Keep all')? Maybe you have an example of the whole process?

    Thanks!

  • abbasi_samiraabbasi_samira MemberPosts:9Contributor I

    hi

    please help me












    I want to weight my data by class

    What do I use?

    thanks
















  • MoWeiMoWei MemberPosts:18Maven
    Hello everyone,

    first of all many thanks for the great article@mschmitz.
    I have a general question. Basically I learned that the approaches of FS can be sorted into the three areas "Filter", "Wrapper" and "Embedded". I'm just wondering in which area the model-based feature weighting can be sorted? Maybe someone can help me and explain it to me briefly. In my first thought I would call it an embedded method because the selection of the subset is directly related to the learning algorithm. Is that correct?

    Another short question, which also fits here. I often use the multi-objective Optimization, which was perfectly explained by@IngoRMin his blog about Feature Selection (Link). Besides forward selection and backward elimination, this is a classic wrapper approach, isn't it?

    Many thanks for your help.

    Best regards

    Moritz
  • UcheUche MemberPosts:3Newbie
    Hi everyone,

    Thanks for the very detailed write up!@mschmitz.
    I would like to know if these methods are implemented to execute in a parallelized manner.
    Especially because some of these methods do not scale well with increasing data sizes.


    Thank you,


Sign InorRegisterto comment.