"Text mining from Excel file and Split validation"

federico_schirofederico_schiro MemberPosts:6Contributor II
edited June 2019 inHelp

hi.

thanks to my teacher I've entered the fantastic world of Rapidminer. I love it, even though Im still a newbie.

Im trying to proceed with a text classification modeling starting with an Excel file with two columns:
Column1 Column2
ROW 1 attribute (text) Label(binomial: simply 0 for negative review and 1 for positive review)

up till now we only work with positive reviews in Txt stored in a folder and negative reviews in Txt stored in another folder, we defined the two of them as positive class and negative class.

I've tried to proceed like this with Read Excel - Process Documents (Tokenize, remove stopwords and case) - Validation (training with SVM + Applay model and Performance)

我使用名义数值avoid SVM capacity problems, but as a result I get only the rooted mean square error, in the Performance vector.

I was looking for the Accuracy of my model instead... sorry for the bad question, I hope somenody can help.
Can I use a txt file as an alternative? see attached file.
thanks a lot in advance

sgenzer Jasmine_

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi@federico_schiro,

    Have you try to usePerformance (Classification)orPerformance (Binominal Classification)operators

    as Performance operator ?

    Regards,

    Lionel

    federico_schiro sgenzer Jasmine_
  • federico_schirofederico_schiro MemberPosts:6Contributor II

    thanks a lot, I got what I was looking for. wow.
    Do you think its likely to get a better accuracy if I work with more reviews as corpus? I have 2000 more reviews (from Amazon and iMdb)

    with the Yelp reviews I have 56%

    PerformanceVector:
    accuracy: 56.00%
    ConfusionMatrix:
    True: 0 1
    0: 41 29
    1: 59 71

    sgenzer Jasmine_
  • federico_schirofederico_schiro MemberPosts:6Contributor II

    thanks a lot. it works.

    I have a question regarding the degree of accuracy. I got 56% here. Is it possible to raise it by adding more reviews in my corpus?
    Hopefully the other reviews wont make it worse. do you think it makes sense to work with 2000more reviews from 2 different platforms or would that make things worse?
    Thanks a lot again

    PerformanceVector:
    accuracy: 56.00%
    ConfusionMatrix:
    True: 0 1
    0: 41 29
    1: 59 71
    AUC: 0.614 (positive class: 1)

    Jasmine_
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    There's a lots of ways to possibly improve your classification results. Some right off the bat that could help is pruning, n_grams, and filtering low character words. You might want to review how you tokenize the words too. If you have lots of numbers in the corpus, the default tokenization paramater of 'non letters' will wipe those out.

    Next you can use another algo, like Linear SVM or Deep Learning. I would use them in conjuction with a Cross Validation, not Split.

    sgenzer federico_schiro Jasmine_
  • federico_schirofederico_schiro MemberPosts:6Contributor II

    fantastic. thanks a lot. you people are very supportive.

    I forgot about some Stemming. Using a Stem (Porter) operator, I've got 3% more accuracy.

    Do you think 3%-30% pruning is ok? or can I change it to get better?
    what are the options with Tokenize? I've selected "Non words", by default

    sgenzer Jasmine_
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    所以文本处理几乎是一种艺术形式一样s it is analytics, it will require some thinking from the domain expert. I don't know what the corpus is that you're trying to classify but sometimes a 3/30% pruning is right, other times 5/80 is good. The short answer is that it depends.

    Of course, if you used an Optimize Parameter operator, you could tun the actual pruning percentages to find the optimal % for the best performance measure.

    With respect to tokenization, I talk about that in my video here:https://www.youtube.com/watch?v=ia2iV5Ws3zo. I do a lot of Twitter mining so a hastag #datascience would be obliterated using the non-letters parameter. Whereas specify character, I could just split on ".,![]"

    sgenzer SGolbert Jasmine_
  • federico_schirofederico_schiro MemberPosts:6Contributor II

    another question:)

    sofar I still havent really fully understand what the blue curve here (ROC threshold) represents.
    I got what the red one expresses, but what about the blue one?
    (I know, my accuracy isnt that great, thats why my red curve looks like that, "cringe")

    Thanks!

    Jasmine_
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,368RM Data Scientist

    hi,

    it represents a confidence threshold. ROC is calculated like this.

    Take a confidence threshold of 0.99 and calculate TPR/FPR for this - > datapoint

    Take a confidence threshold of 0.98 and calculate TPR/FPR for this -> data point

    The red curve are the TPR/FPR value. The blue curve are the corresponding thresholds to get this values.

    Best,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    Thomas_Ott sgenzer Jasmine_
  • federico_schirofederico_schiro MemberPosts:6Contributor II

    hi Martin and thnks for the answer..

    I can understand what the ROC curve is but its with the threshold curve (Blue) that I feel confused.

    I watched several videos about it, also thought so: when the Threshold is high, I have a higher TPR (coz I "accept" only high predictive probabilities = its easier to get it predicted right), whereas when the threshold is low (for instance <0.5 predictive probability) I see a higher TFR

    also, I tried the TF IDF without Prune, and my Acccuracy skyrocketed!

    Jasmine_
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,368RM Data Scientist

    Hi,

    the threshold tells you not too much about your performances. Read it like this: If you want to get this TPR/FPR value, you need to use the blue threshold.

    Does this make more sense?

    Best,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer Jasmine_
Sign InorRegisterto comment.