Text mining classification with multiple classes

marijn_nbrmarijn_nbr MemberPosts:2Contributor I
edited December 2019 inHelp

Hi,

I am relatively new to data science and therefore I have some questions:

I’m working on a text mining multi-class classification problem for a study assignment. The aim of my assignment is to build a model that predicts the ‘score’ attribute of textual reviews of products. The possible ‘score’ attribute values (classes) are 1,2,3,4 or 5, so it is like a star rating of reviews. My dataset contains 6 features:

  • ReviewerID, ReviewerName, ReviewText, Score, Summary and the length of my textual review.
  • There are 5000 reviews (rows) in my dataset and a few missing values (ReviewerName)
    • 3000 reviews are 5 star reviews, 1000 reviews are 4 star reviews and the rest of the reviews is a 1, 2 or 3 star review. The classes are imbalanced.
  • I've uploaded the dataset

I have used various classification methods (kNN, naïve Bayes and Logistic regression SVM) but I cannot seem to achieve a higher accuracy of my model that 62%. I don’t know if this is a good accuracy or not, the random guess in 20% but I have the idea that there are things I can do to make a more accurate model. If I try to rebalance the dataset the accuracy drops to max 40%.

The process is: Read CSV (using quotes) -> numerical to polynomial > set role (‘score’ as label) > nominal to text > select attributes (reviewer ID is left out) > split data (70%/30%) > process documents (tokenize, stem, filter stop words, transform cases, generate n-grams (2)) > cross validation 10 fold -> KNN) > performance)

I don’t know if miss steps in my process or that I make mistakes or maybe 62% accuracy is the max. I hope that someone can help me out or give me tips!

Thanks!

Greetings Marijn

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    Please post your XML, use the option to paste it in.

    sgenzer
  • kaymankayman MemberPosts:662Unicorn

    62% is not that bad, specifically when using review ratings as main label.

    There are a couple of 'traps' when looking at review ratings, having some experience myself with Amazon review ratings here are some of my observations :

    Culture plays a role : Not sure how your dataset is balanced, but when using european data it is for instance very obvious that the more southern you go (France, Spain, Portugal etc) the likelyhood people will give a 5 even if not perfectly happy rises, whereas the more northern you go (netherlands, germany etc) people tend to consider a 3 already a high score, as perfection doesn't exist. Bit of black and white picture but the differences are clear. A 5 in Spain can be like a 4 in Belgium and a 3 in Germany.

    Ambuiguity为王:人们说特性great but feature b sucks, but that's ok since I don't use it anyway so the score is still high, this happens quite a lot having an impact on your score since algorithms tend to give this a neutral score as the negative compensates the possitive.

    Multitopic : bit related to the above, where people tend to go through the complete feature list, leading again to 'flat scores'

    How we tackled this : We used the ratings to do a first clustering, but combining 4 and 5 (mainly possitive), 3 as neutral, 1 and 2 as negative. This should give already better results as the 5 scale logic since that will never work reliably

    next we worked in 2 flows, first topic analysis to get rid of all the small talk, then perform sentiment analysis on topics by review. Since topics can have different weights this will also have an impact on the overall happyness associated with a review. Simply put, when reviewing for instance a headphone review the sentiment towards the sound will be more important than the sentiment towards packaging material.

    Hope this helps a bit, but best advice is already to bring down your 5 labels to 3.

    sgenzer Thomas_Ott
  • marijn_nbrmarijn_nbr MemberPosts:2Contributor I

    Hi guys,

    Thanks for your replies, they are very helpfull! Here is my process xml:

















    <关键= " data_se列表t_meta_data_information">






















































    <描述一致= "中心”=“黄色”颜色的颜色="false" height="105" resized="false" width="180" x="84" y="17">Split words
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="105" resized="false" width="180" x="445" y="344">Remove Stop Words and put everything to lower-case
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="105" resized="false" width="180" x="534" y="75">n-grams transformation












































    <参数键= " k " value = " 15 " / >



































    <操作符= " true " class = " declare_missing_激活value" compatibility="7.6.003" expanded="true" height="82" name="Declare Missing Value" width="90" x="45" y="34"/>

































    <描述一致= "中心”=“黄色”颜色的颜色="false" height="77" resized="true" width="187" x="356" y="17">The labels are numerical here, this is why I am doing Numerical to Polynomial
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="59" resized="true" width="109" x="358" y="209">'score' is set to be the label
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="85" resized="true" width="140" x="86" y="347">I need to transform the cell with the text from nominal to text.
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="50" resized="true" width="122" x="376" y="604">SPLIT Training and Testing
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="66" resized="true" width="211" x="10" y="163">Removing rows with missing values (reviewerName) does not improve accuracy
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="63" resized="true" width="154" x="464" y="120">Removing outliers has negative imapact on accuracy



    一个问题:运营商我能用再保险duce the number of classes (1,2,3,4 and 5) to 3 classes, where:

    - 1 and 2 are 'Negativ'

    - 3 is 'Neutral'

    - 4 and 5 are 'Positive'

    Greetings Marijn

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi@marijn_nbr,

    You can use theDiscretize (Discretize by User Specification)operator to reduce the number of classes of your label from 5 to 3

    Here the process with the insertion of this new operator.

















    <关键= " data_se列表t_meta_data_information">































































    <描述一致= "中心”=“黄色”颜色的颜色="false" height="105" resized="false" width="180" x="84" y="17">Split words
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="105" resized="false" width="180" x="445" y="344">Remove Stop Words and put everything to lower-case
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="105" resized="false" width="180" x="534" y="75">n-grams transformation












































    <参数键= " k " value = " 15 " / >



































    <操作符= " true " class = " declare_missing_激活value" compatibility="8.0.001" expanded="true" height="82" name="Declare Missing Value" width="90" x="45" y="34"/>


































    <描述一致= "中心”=“黄色”颜色的颜色="false" height="77" resized="true" width="187" x="356" y="17">The labels are numerical here, this is why I am doing Numerical to Polynomial
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="59" resized="true" width="109" x="358" y="209">'score' is set to be the label
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="85" resized="true" width="140" x="86" y="347">I need to transform the cell with the text from nominal to text.
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="50" resized="true" width="122" x="376" y="604">SPLIT Training and Testing
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="66" resized="true" width="211" x="10" y="163">Removing rows with missing values (reviewerName) does not improve accuracy
    <描述一致= "中心”=“黄色”颜色的颜色="false" height="63" resized="true" width="154" x="464" y="120">Removing outliers has negative imapact on accuracy



    As planned by@kayman, the accuracy of your model is significantly better with this transformation.

    Regards,

    Lionel

Sign InorRegisterto comment.