Text mining classification with multiple classes
Hi,
I am relatively new to data science and therefore I have some questions:
I’m working on a text mining multi-class classification problem for a study assignment. The aim of my assignment is to build a model that predicts the ‘score’ attribute of textual reviews of products. The possible ‘score’ attribute values (classes) are 1,2,3,4 or 5, so it is like a star rating of reviews. My dataset contains 6 features:
- ReviewerID, ReviewerName, ReviewText, Score, Summary and the length of my textual review.
- There are 5000 reviews (rows) in my dataset and a few missing values (ReviewerName)
- 3000 reviews are 5 star reviews, 1000 reviews are 4 star reviews and the rest of the reviews is a 1, 2 or 3 star review. The classes are imbalanced.
- I've uploaded the dataset
I have used various classification methods (kNN, naïve Bayes and Logistic regression SVM) but I cannot seem to achieve a higher accuracy of my model that 62%. I don’t know if this is a good accuracy or not, the random guess in 20% but I have the idea that there are things I can do to make a more accurate model. If I try to rebalance the dataset the accuracy drops to max 40%.
The process is: Read CSV (using quotes) -> numerical to polynomial > set role (‘score’ as label) > nominal to text > select attributes (reviewer ID is left out) > split data (70%/30%) > process documents (tokenize, stem, filter stop words, transform cases, generate n-grams (2)) > cross validation 10 fold -> KNN) > performance)
I don’t know if miss steps in my process or that I make mistakes or maybe 62% accuracy is the max. I hope that someone can help me out or give me tips!
Thanks!
Greetings Marijn
Answers
Please post your XML, use the > option to paste it in.
62% is not that bad, specifically when using review ratings as main label.
There are a couple of 'traps' when looking at review ratings, having some experience myself with Amazon review ratings here are some of my observations :
Culture plays a role : Not sure how your dataset is balanced, but when using european data it is for instance very obvious that the more southern you go (France, Spain, Portugal etc) the likelyhood people will give a 5 even if not perfectly happy rises, whereas the more northern you go (netherlands, germany etc) people tend to consider a 3 already a high score, as perfection doesn't exist. Bit of black and white picture but the differences are clear. A 5 in Spain can be like a 4 in Belgium and a 3 in Germany.
Ambuiguity为王:人们说特性great but feature b sucks, but that's ok since I don't use it anyway so the score is still high, this happens quite a lot having an impact on your score since algorithms tend to give this a neutral score as the negative compensates the possitive.
Multitopic : bit related to the above, where people tend to go through the complete feature list, leading again to 'flat scores'
How we tackled this : We used the ratings to do a first clustering, but combining 4 and 5 (mainly possitive), 3 as neutral, 1 and 2 as negative. This should give already better results as the 5 scale logic since that will never work reliably
next we worked in 2 flows, first topic analysis to get rid of all the small talk, then perform sentiment analysis on topics by review. Since topics can have different weights this will also have an impact on the overall happyness associated with a review. Simply put, when reviewing for instance a headphone review the sentiment towards the sound will be more important than the sentiment towards packaging material.
Hope this helps a bit, but best advice is already to bring down your 5 labels to 3.
Hi guys,
Thanks for your replies, they are very helpfull! Here is my process xml:
一个问题:运营商我能用再保险duce the number of classes (1,2,3,4 and 5) to 3 classes, where:
- 1 and 2 are 'Negativ'
- 3 is 'Neutral'
- 4 and 5 are 'Positive'
Greetings Marijn
Hi@marijn_nbr,
You can use theDiscretize (Discretize by User Specification)operator to reduce the number of classes of your label from 5 to 3
Here the process with the insertion of this new operator.
As planned by@kayman, the accuracy of your model is significantly better with this transformation.
Regards,
Lionel