Tokenization vs N-grams

HeikoeWin786 · August 2020

Hello guys,

I am doing sentiment analysis in Rapidminer. While performing word vector, I find that there is two approach tokenization (by non-letter) and generate n-grams. I am not sure the main difference between this two operator and their best use-cases. Can someone explain me how this two works differently in rapidminer? For sentiment analysis, which approach would you suggest; tokenization or n-grams?

Thanks and regards,
Heikoe

kayman · August 2020

n-grams are successive tokens (or words in this case), so they are related. Using n-grams never hurts an NLP workflow so just use them if your workflow can handle it. In this case you have both your single tokens (words) and the n-grams that can be used for your training.

Bi-grams will do fine for sentiment, anything more isn't typically give much added value.

HeikoeWin786 · August 2020

@kayman

Thanks for your clarification here.
Meaning to say that, we use Bi-grams as a part of data pre-processing.
i.e. inside the process document to data operator, we put b-grams as a part of data pre-processing together with the tokenize, stem porter and etc?

Thanks and regards,
Heikoe

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Tokenization vs N-grams

Best Answer

Answers