"Clustering and similarity of the text documents"

zacevzacev MemberPosts:6Contributor II
edited June 2019 inHelp

Hello,

I have been recently dealing with some extraction methods of the keyphrases from the text. Now I would like to solve another problem: Clustering the documents& similarity between them.

It goes like that: Let us suppose that we have some security documents from various sources. I would like to examine these documents and cluster them. Sometimes a document can be published from various sources aboutthe same主题/设备/ problem. The goal is to find these 'overlapping' documents and put the in one cluster. Published documents have the following features: the structure may be changed, some words may be added, but the key phrases are the same, mainly a number that identifies a report or other key phrases, that appear repeatedly. Any suggestions about the model? I've tried to use several clustering parameters and metrics, but the results are rather not good. The approach based on frequency of common words would fail, because of the specific structure of the documents. Thanks in advance for any suggestions.

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,439RM Data Scientist

    亲爱的Zacev,

    as a first question: Is it possible to make this a supervised problem by having annotated data? That would make life way easier.

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • zacevzacev MemberPosts:6Contributor II

    Would you like me to provide samples of documents that I am working with or the process? I'm not sure If I understood correctly.

  • zacevzacev MemberPosts:6Contributor II




    <宏/ >

    < =“tru运营商激活e" class="process" compatibility="7.2.000" expanded="true" name="Process">

    < =“tru运营商激活e" class="text:process_document_from_file" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="380" y="136">









    < =“tru运营商激活e" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="187">


    < =“tru运营商激活e" class="text:filter_stopwords_english" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="246" y="187"/>
    < =“tru运营商激活e" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="187"/>
    < =“tru运营商激活e" class="text:generate_n_grams_terms" compatibility="7.2.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="514" y="187"/>










    < =“tru运营商激活e" class="select_attributes" compatibility="7.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="447" y="340">



    < =“tru运营商激活e" class="fast_k_means" compatibility="7.2.000" expanded="true" height="82" name="Clustering (2)" width="90" x="581" y="442">

















    I have uploaded the full process. So far I have taken 6 documents from three different sources. Successfully Clustering put these document into 3 different clusters, so all the documents from one source belong to the same cluster. Now, as I wrote, I would like to sort these documents in clusters, so they would be clustered upon some keywords or ID numbers in the same cluster - if two documents consider the same device name, they should be put together (doesn't matter from which source).

Sign InorRegisterto comment.