"Text Mining - Document Similarity/Clustering"

rahi84 · August 2015

Hello All,

我试图执行文档相似性/集群ing in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:

1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity

MartinLiebig · August 2015

Hi,

is your text attribute of type text or nominal? You need to use text in order to use data to document. Further i would recommend to use cross distances instead of data to similarity.

Attached is a sample process.

Best,
Martin

Simply generate some test data

<运营商激活= " true " class = "文本:标记”compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>

rahi84 · August 2015

Thank you I've solved this. This issue was that the data was not in the type text. The Nominal to Text node helped that.

MartinLiebig · August 2015

This sounds pretty reasonable.

Could you post the XML of your process? Then i could check way easier for the mistake.

Cheers,
Martin

rahi84 · August 2015

Hi Martin,

I have 'blacked out' the directory for privacy.

Please see below the XML code:

<运营商激活= " true " class = "文本:标记”compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>

mehak · February 2017

hello... Please help me how to cluster similar meaning words in a document. please help me. its really urgent.

Telcontar120 · February 2017

There are a number of different ways that you might approach that, but if you have a relatively short list of synonymous words/tokens, then you can use the "Replace Token" operator inside the "Process Documents" operator. It allows you to map a set of related tokens to a single token that represents the set. You can create as many entries as you want.

If you need something more complicated, there is a synonym finding operator from the Wordnet extension which is available for free in the RapidMiner marketplace.

mehak · February 2017

thank you so much for your response. can you please tell me how to make cluster of all of them?

Howdy, Stranger!

Quick Links

Categories

牵牛星RapidMiner社区

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Text Mining - Document Similarity/Clustering"

Best Answers

Answers