"Text Mining - Document Similarity/Clustering"
Hello All,
我试图执行文档相似性/集群ing in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:
1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity
我试图执行文档相似性/集群ing in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:
1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity
Tagged:
0
Best Answers
-
MartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,438RM Data ScientistHi,
is your text attribute of type text or nominal? You need to use text in order to use data to document. Further i would recommend to use cross distances instead of data to similarity.
Attached is a sample process.
Best,
Martin
Simply generate some test data
<运营商激活= " true " class = "文本:标记”compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
- Sr. Director Data Solutions, Altair RapidMiner -
Dortmund, Germany0 -
rahi84 MemberPosts:3Contributor IThank you I've solved this. This issue was that the data was not in the type text. The Nominal to Text node helped that.0
Answers
Could you post the XML of your process? Then i could check way easier for the mistake.
Cheers,
Martin
Dortmund, Germany
I have 'blacked out' the directory for privacy.
Please see below the XML code:
<运营商激活= " true " class = "文本:标记”compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>
hello... Please help me how to cluster similar meaning words in a document. please help me. its really urgent.
There are a number of different ways that you might approach that, but if you have a relatively short list of synonymous words/tokens, then you can use the "Replace Token" operator inside the "Process Documents" operator. It allows you to map a set of related tokens to a single token that represents the set. You can create as many entries as you want.
If you need something more complicated, there is a synonym finding operator from the Wordnet extension which is available for free in the RapidMiner marketplace.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
thank you so much for your response. can you please tell me how to make cluster of all of them?