Predicting whether a product is a beverage or not using a csv

luiz_vidalluiz_vidal MemberPosts:14Contributor II
edited December 2018 inHelp

Hi guys,

I am quite new to Rapid Miner and here is my "problem"

我想建立一个过程,我有2列s in a csv file (Desc - Description and Bebidas - 0 or 1 ), I want to predict if a product is a beverage (portuguese for bebida) by the description. I have gotten here so far

process.JPGMy processAfter I pass through this transformation though I put a Random Forest algorithm, but somehow I'm not able to tell which column is the prediction column, I also tried with Naive Bayes. I mean, the algorithm choice itself isn't an issue, but after processing documents I would like a manner to transform it to data again in order to use it for the prediction. Can someone help me to do it the right way? I'm kind of stuck.. thanks in advance.
Follow below the xml of my process







<运营商激活= " true " class = "过程”兼容ibility="8.0.001" expanded="true" name="Process">




















@\[\\\]_`{|}~]"/>;






































Tagged:

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn
    Solution Accepted

    Your process is not quite what I'm used to when building text processing in RapidMiner. I don't understand what the Replace operator is doing? Is that supposed to help the tokenization? If so, you can select 'specify parameters' and paste it in there.

    Rearranging it, I would do something like this.







    <运营商激活= " true " class = "过程”兼容ibility="7.6.003" expanded="true" name="Process">




















    @\[\\\]_`{|}~]"/>;














    You can save yourself one Generate Attributes entry by using this operator to lower the case of your text























    In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)


    <运营商激活= " true " class = " apply_model“compatibility="7.6.003" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">














    <描述一致=“左”颜色=“蓝”色= " true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.

    A cross-validation evaluating a decision tree model.















    Of course, swap out the decision tree algo for the one you want, but this process passes the label with classes '0' and '1' to the Text Processing and then trains on it using a Cross Validation. You might get horrible accuracy in the first pass but adjusting the pruning, algorithm, and parameter optimization all help.

    sgenzer

Answers

  • kaymankayman MemberPosts:662Unicorn

    Did you make a label? Your prediction value needs to be defined as label by using the role operator. Otherwise the system has no clue which attribute to use as predictor.

  • luiz_vidalluiz_vidal MemberPosts:14Contributor II

    Yes I did a label, even the process suggests it as a "fix". The problem is that I want to predict the field "Bebida" and this field doesnt come along after the process documents operator. I have the description field (which can be 'AAAAA BBBBB CCCCC') I perform some cleansing process which transforms my field description into 'AAAA BBBB' for example.. then I transform it into documents, tokenize and pass it through a stop words process then.. after it come out from the process documents process I wanna predict if the 'AAAA BBBB' is a yes or no field.. that's it.

  • luiz_vidalluiz_vidal MemberPosts:14Contributor II

    Thomas, thanks for your reply.

    Well, the replace process is cleansing a bit the description colum, the problem is that usually users allowed to register anything in this thing, so for example a coke (coca-cola here in Brazil) they would inform coke 300ml, coca-cola, coke1l, coke pack, coke@, coke.fanta and so on, this first replace is just removing the special characters in order to ease the process for the tokenizer. After the tokenizer I also remove the stop words such as (ml - mililiter, a, g, l, etc), so when the process documents task finish I would have a more cleaner description of a product in order to classify whether it is a beverage or not (0 - no, 1-yes).

    By your experience, should I generate a binomial field with yes and no instead of 0 and 1 ?

  • luiz_vidalluiz_vidal MemberPosts:14Contributor II

    Thomas,

    I imported your xml and that was exactly I was trying to do but I was being unable to.

    The funny thing now is that the classifier is reaching 100% accuracy.. which I believe doesn't seem good, am I right?

    By the nature of my data, which is a simple description column of a product (as I exemplified coke, broomstick, water, bla bla bla) and it needs to be classified as a Yes or No category of product, which would be the best algorithms to be run with and I wonder if I would provide a subset of my data you would be capable of helping me out discovering what is wrong that I'm doing over my process or help defining what is the right sequence of processes in order to correct classify as I imagine that 100% doesn't sound good.

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    My preference is to use 'yes' or 'no' instead of 1 and 0, but that's me. You can change that in your Generate Attributes by putting in "yes" and "no"

  • luiz_vidalluiz_vidal MemberPosts:14Contributor II

    Thomas,

    I have another question, now regarding my data.

    The algorithm is running fine and the accuracy is being 100% if I run a decision tree, as the data is unbalanced, but this is the nature of the data as for a wide range of products, around 5-10% will be beverage, as the others might be clothes, food , etc

    Would would be the best way to split the data for more accurate accuracy?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    Well 100% means your overfitting if you're using a Decision Tree. I just slapped that in there to show you as an example.

    What you need to do is balance the data better, you can either try using the SMOTE operator in the Operator Toolbox or do some macros to extract the # of classes you have and pass those macros to a Sample operator. Note, the right thing to do is put the Sample operator (or SMOTE) inside the training side of the Cross Validation, not outside.

    sgenzer
Sign InorRegisterto comment.