Tokenization vs N-grams
data:image/s3,"s3://crabby-images/e9e37/e9e376f86fc989f8be36462752cae2b4a4f55b06" alt="HeikoeWin786"
data:image/s3,"s3://crabby-images/7371c/7371cabaeb0bab47310576cbbb2ad0922c241e63" alt=""
inHelp
Hello guys,
I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?
Thanks and regards,
Heikoe
I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?
Thanks and regards,
Heikoe
0
Best Answer
-
kayman MemberPosts:662
Unicorn
n-grams are successive tokens (or words in this case), so they are related. Using n-grams never hurts an NLP workflow so just use them if your workflow can handle it. In this case you have both your single tokens (words) and the n-grams that can be used for your training.
Bi-grams will do fine for sentiment, anything more isn't typically give much added value.1
Answers
Thanks for your clarification here.
Meaning to say that, we use Bi-grams as a part of data pre-processing.
i.e. inside the process document to data operator, we put b-grams as a part of data pre-processing together with the tokenize, stem porter and etc?
Thanks and regards,
Heikoe