"Text Mining - Document Similarity/Clustering"

rahi84rahi84 MemberPosts:3Contributor I
edited June 2019 inHelp
Hello All,

我试图执行文档相似性/集群ing in RapidMiner on a survey text field and having problems so far. The data is saved in an Excel file (.xlsx) and I need to process the documents so that the case is lowered, words are tokenized, stemmed and the stopwords filtered out. Could you please run me through the nodes that I need to assign to the data so that I can perform a document similarity and clustering. I have watched 'el chief' tutorials on YouTube and unfortunately it hasn't worked out. I have tried the following nodes (in order) and I get a blank output:

1. Read Excel
2. Data to Documents
3. Process Documents (+ Tokenize, Filter Stopwords( English), Transform Cases, Stem (Porter))
4. Data Similarity

Best Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,438RM Data Scientist
    Solution Accepted
    Hi,

    is your text attribute of type text or nominal? You need to use text in order to use data to document. Further i would recommend to use cross distances instead of data to similarity.

    Attached is a sample process.

    Best,
    Martin





































    Simply generate some test data







    <运营商激活= " true " class = "文本:标记”compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>






























    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • rahi84rahi84 MemberPosts:3Contributor I
    Solution Accepted
    Thank you I've solved this. This issue was that the data was not in the type text. The Nominal to Text node helped that.

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,438RM Data Scientist
    This sounds pretty reasonable.

    Could you post the XML of your process? Then i could check way easier for the mistake.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • rahi84rahi84 MemberPosts:3Contributor I
    Hi Martin,

    I have 'blacked out' the directory for privacy.

    Please see below the XML code:




















    <运营商激活= " true " class = "文本:标记”compatibility="6.4.001" expanded="true" height="60" name="Tokenize" width="90" x="45" y="30"/>



























  • mehakmehak MemberPosts:6Contributor I

    hello... Please help me how to cluster similar meaning words in a document. please help me. its really urgent.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635Unicorn

    There are a number of different ways that you might approach that, but if you have a relatively short list of synonymous words/tokens, then you can use the "Replace Token" operator inside the "Process Documents" operator. It allows you to map a set of related tokens to a single token that represents the set. You can create as many entries as you want.

    If you need something more complicated, there is a synonym finding operator from the Wordnet extension which is available for free in the RapidMiner marketplace.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • mehakmehak MemberPosts:6Contributor I

    thank you so much for your response. can you please tell me how to make cluster of all of them?

Sign InorRegisterto comment.