"Clustering and similarity of the text documents"

zacev · August 2016

Hello,

I have been recently dealing with some extraction methods of the keyphrases from the text. Now I would like to solve another problem: Clustering the documents& similarity between them.

It goes like that: Let us suppose that we have some security documents from various sources. I would like to examine these documents and cluster them. Sometimes a document can be published from various sources aboutthe same主题/设备/ problem. The goal is to find these 'overlapping' documents and put the in one cluster. Published documents have the following features: the structure may be changed, some words may be added, but the key phrases are the same, mainly a number that identifies a report or other key phrases, that appear repeatedly. Any suggestions about the model? I've tried to use several clustering parameters and metrics, but the results are rather not good. The approach based on frequency of common words would fail, because of the specific structure of the documents. Thanks in advance for any suggestions.

MartinLiebig · August 2016

亲爱的Zacev,

as a first question: Is it possible to make this a supervised problem by having annotated data? That would make life way easier.

~Martin

zacev · August 2016

Would you like me to provide samples of documents that I am working with or the process? I'm not sure If I understood correctly.

zacev · August 2016





<宏/ >

< =“tru运营商激活e" class="process" compatibility="7.2.000" expanded="true" name="Process">

< =“tru运营商激活e" class="text:process_document_from_file" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="380" y="136">









< =“tru运营商激活e" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="187">


< =“tru运营商激活e" class="text:filter_stopwords_english" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="246" y="187"/>
< =“tru运营商激活e" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="187"/>
< =“tru运营商激活e" class="text:generate_n_grams_terms" compatibility="7.2.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="514" y="187"/>










< =“tru运营商激活e" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="340">



< =“tru运营商激活e" class="fast_k_means" compatibility="7.2.000" expanded="true" height="82" name="Clustering (2)" width="90" x="581" y="442">

I have uploaded the full process. So far I have taken 6 documents from three different sources. Successfully Clustering put these document into 3 different clusters, so all the documents from one source belong to the same cluster. Now, as I wrote, I would like to sort these documents in clusters, so they would be clustered upon some keywords or ID numbers in the same cluster - if two documents consider the same device name, they should be put together (doesn't matter from which source).

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Clustering and similarity of the text documents"

Answers