Another query on document classifcation and assigning of weights to keywords
Hi,
Thanks for the response earlier.
I have a couple of more questions on document classification although unrelated to what I asked last time around.
+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate). Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
+ Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
+ How can I assign weights to certain keywords in Rapidminer? I think this will help me to improve accuracy of the model.
+ As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?
Thanks.
Regards,
Sharath
Thanks for the response earlier.
I have a couple of more questions on document classification although unrelated to what I asked last time around.
+ I am developing a Naive Bayes model on historical data (with label 4 categories) to classify documents. I have a pretty skewed sample (2 of the categories dominate). Is it important to have the data balanced (i.e 25%) ? I ask this because the accuracy of my model is only 70%, even though I feel that it should be around 80%-85% as the data I am analyzing is pretty descriptive and is of good quality.
+ Based on your experience, can you tell me how important filtering stopwords is essential to building a classification model. Currently, I have used used only the English stopwords. Maybe I would have to build a dictionary on my own to filter out additional stopwords based on your response.
+ How can I assign weights to certain keywords in Rapidminer? I think this will help me to improve accuracy of the model.
+ As an alternative, is it possible to classify documents purely based on keywords for each category in an input file without actually building a model for classification (KNN, Naive Bayes)?
Thanks.
Regards,
Sharath
Tagged:
0
Answers
Further accuracy as a measure is highly class balance dependend. If you have unbalanced data, accuracay becomes hard to interpret. In personally think that it is not that important, because most stop words are thrown out by TF/IDF or Feature selection
so you would simply count? Yes it is. I built a process like this somewhere here in the forum.
Btw: Have you tried a linear SVM?
Dortmund, Germany
我认为你错过了你错过了回复my query on assigning weights. Would appreciate if you could respond to this one as well.
Thanks.
Regards,
Sharath
the answer is basicly you can not add weights for attributes, only for examples. The reason for it is that most models choos his weights "by its own". Think about a linear regression. Their you do not want to change the coefficients ( ~weights) by your own.
The only thing you can do is dupicating attributes.
Best,
Martin
Dortmund, Germany
你能请详细一点啊n what you mean by adding weights to examples and not attributes with reference to my case above?
Also when you say duplicating attributes do you mean duplicating certain mails (in my case) that are very descriptive and have a lot of keywords before building a model?
Thanks again.
Sharath
add another coloumn with Generate attributes and set the role of it to weight. Then all learners who can handle weights will use them.
Dortmund, Germany