Binary text classification - Help in process needed.

thiemothiemo MemberPosts:8Contributor I
edited November 2018 inHelp

Hey guys,

We want to do a binary classification on a text data set with the distribution 80% negative class, 20% positive class. In order to reach maximum statistical meaningfulness, we want to do so by using 10-fold cross validation.

If we model this within Rapidminer, we are unsuccessful since it doesn’t output any statistical metrics (like precision, recall, etc):

Bildschirmfoto 2016-12-01 um 12.14.37.pngBildschirmfoto 2016-12-01 um 12.15.34.png

We found a workaround that works, but it doesn’t make any sense out of a ML perspective: If we first divide into training or test and then use 10-fold-crossvalidation it works — But the training or test split should be part of the crossvaligdation (9 training folds, 1 test fold, 10 iterations). So right now the only way to get this working is by FIRST dividing into test and training and THEN use X-Validation. Did we model it the right way or did we miss anything?

Bildschirmfoto 2016-12-01 um 12.14.37.pngBildschirmfoto 2016-12-01 um 12.15.01.pngBildschirmfoto 2016-12-01 um 12.15.34.png

If you need any more information for helping us, just comment.

Thank you very much in advanced.

Best regards!

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    Ok, silly questions but did you set a label role in your data set?

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    This sounds like a strange problem, but it's very hard to troubleshoot from a screenshot of a process--can you post the process itself for review? You can export it from the file menu and attach it as a file.

    Thanks,

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • thiemothiemo MemberPosts:8Contributor I

    Hey T-Bone,

    yes I set a label role;)

    Regards,

  • thiemothiemo MemberPosts:8Contributor I

    Hey Brian,

    thank you for your answer.

    Here is the process which gives me results but makes no sense;)

    It would be great if you could help me. If you need any more information I am happy to provide them;)

    Best regards,
    Thiemo

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    I would double check your process, something doesn't appear to be correct because I can easily extract P/R's and confusion matrix.

    看到示例XML below. This process takes Tweets, does a bit of processing up front and generates a random label. The Process Documents from Data operator then processes them to TF-IDF (you can select Binary Occurances) and spits out the confusion matrix.
















































    <描述一致= =“绿色”“左”颜色色= " true" height="80" resized="true" width="248" x="37" y="137">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)

















    The model created in the Training step is applied to the current test set (10 %).<br/>The performance is evaluated and sent to the operator results.

    A cross-validation evaluating a decision tree model.














  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Hi@thiemo,

    I took your original process, and modified it only by inputting a simple toy example set using the identical Excel format (since I don't have your original dataset). Then I removed your outer split validation, and ran it again only using the cross-validation that you had as an inner operator. And it works fine! Here's the modified process. So if you are having problems, I suspect it must be something strange related to your original dataset. There's nothing that appears to be wrong with the process or with the cross-validation operator. Sorry I couldn't be more definitive.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    And here's the Excel file I used as input in case you are interested.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • thiemothiemo MemberPosts:8Contributor I

    Hey Brian,

    thank you very much for your solution. I downloaded the process and the excel and tried it and it works perfectly, but I do not get the performance parameters such as accurancy, recall, precision and the AUC?

    How can I use this process and receive those 4 parameters?

    Regards,


    Thiemo

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    Hi@thiemo,

    I'm not sure what you mean--those performance metrics are all available in the performance tab output from the process when it runs. See the attached screenshot. This is part of the output for the process I supplied with no changes. Of course, the values are useless with my test examples since there are only 10 of them, but you can see that AUC, accuracy, precision, and recall are all available. If you run it on a larger dataset then they should all be there.

    performance output.PNG

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • thiemothiemo MemberPosts:8Contributor I
  • thiemothiemo MemberPosts:8Contributor I

    Hi Biran,

    thanks again for the quick answer.

    然而,如果我把你上传和美国的过程e the excel of you, I get an result but not the statistical paremeters such as precision and recall.

    Bildschirmfoto 2016-12-03 um 15.33.04.png

    Did you do anything special while importing the data? I just set the type of need data to binominal. What can I do to get the precison and recall for the data?

    Thanks you and best regards,

    Thiemo

  • thiemothiemo MemberPosts:8Contributor I
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    What you see in the Statistics tab is just some basic descriptive statistics of your data set, there will be no P/R or confusion matrix because you didn't do any modeling yet. This view is similar to a summary or head command in Python/R.

    You need to attached a Cross Validation operator with a machine learning algoritm emebded + performance operator to generate the P/R's and confusion matrix.

  • thiemothiemo MemberPosts:8Contributor I

    Hi T-Bone,

    thank you for the answer.

    Exaclty this was my intitial problem. If I add another corss validation operater with a performance operator around the actual process, then it makes no sense anymore, right?

    Regards,

    Thiemo

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    From that point in your process (where you show the staistics tab) now connect a Cross Validation operator (insert your algo in the Training side and an Apply Model and Performance operator) THEN connect the "Per" port on the Cross VAlidation to the Results port. This will out put the P/R's etc for you.

Sign InorRegisterto comment.