Text Mining - Word Similarity
Hey,
I want to find the similarity between words used in a collection of articles; like which words have been used together more often than others. There are softwares like Automap and WordStat which are able to that; but the first doesn't consider the non-english letters (which is important for my case) and the latter is expensive!
I'm trying RM now and I noticed that it has the document similarity operator, but doesn't have one in a word-level. I gave a shot for association rules, but the ones that it finds didn't make much sense for my articles; like also-->able with probability 0.75
So I've decided to construct my own similarity model as below:
Process Documents from files ==> Wordlist to Data ==> Data to Similarity ==> Similarity to Data ==> Write Excel
The resulting table included the similarities between words as I wanted but there is double counting. For example, the similarity between the word #1068 and #963 appears twice like this:
FIRST_ID SECOND_ID DISTANCE
963 1 068 103
1 068 963 103
This makes my results two times bigger than it should be, and it complicates the visualisations.
I couldn't find a thread about this double-counting in the forum, but I could use some help.
Thank you
I want to find the similarity between words used in a collection of articles; like which words have been used together more often than others. There are softwares like Automap and WordStat which are able to that; but the first doesn't consider the non-english letters (which is important for my case) and the latter is expensive!
I'm trying RM now and I noticed that it has the document similarity operator, but doesn't have one in a word-level. I gave a shot for association rules, but the ones that it finds didn't make much sense for my articles; like also-->able with probability 0.75
So I've decided to construct my own similarity model as below:
Process Documents from files ==> Wordlist to Data ==> Data to Similarity ==> Similarity to Data ==> Write Excel
The resulting table included the similarities between words as I wanted but there is double counting. For example, the similarity between the word #1068 and #963 appears twice like this:
FIRST_ID SECOND_ID DISTANCE
963 1 068 103
1 068 963 103
This makes my results two times bigger than it should be, and it complicates the visualisations.
I couldn't find a thread about this double-counting in the forum, but I could use some help.
Thank you
Tagged:
0
Answers
Well actually my intention is to find word co-occurences within a collection of documents, really. Is there anyone who has done such a project in Rapidminer?
in Process Documents, did you remove stopwords with the Filter Stopwords operator? That will most likely remove frequent words such as "also", "and", "I" etc. and thus clean up your association rules a bit.
Furthermore, to use FPGrowth and Association Rules you most probably want to use the "binary occurences" mode for the word vector creation in Process Documents.
Best regards,
Marius