text mining - all records placed in one cluster

Diana_WegnerDiana_Wegner MemberPosts:4Contributor I
edited December 2019 inHelp
I have a follow-up question to my previous text mining issue. K-means, like many of the other clustering modules, places all of my 3000 records in one cluster. I've tried different parameters with no luck. Do you have any hints to resolve this issue? Random Clustering is the only one that generates the number of clusters requested, however the results are not what I expected.

Here is the xml... THANKS!!

on="1.0" encoding="UTF-8" standalone="no"?>



< >输出
//Local Repository/Result 1 Process Document Cluster
//Local Repository/Result 2 clustering
//Local Repository/Result 3



<运营商激活= " true " class = "process" compatibility="5.3.013" expanded="true" name="Process">

<运营商激活= " true " class = "read_csv" compatibility="5.3.013" expanded="true" height="60" name="Read CSV" width="90" x="45" y="255">












<运营商激活= " true " class = "text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="255">








<运营商激活= " true " class = "text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
<运营商激活= " true " class = "text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="30"/>
<运营商激活= " true " class = "text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="30"/>
<运营商激活= " true " class = "text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="75">






<连接过滤器Stopwor from_op = "ds (2)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>






<运营商激活= " true " class = "dbscan" compatibility="5.3.013" expanded="true" height="76" name="Clustering" width="90" x="246" y="30"/>
<运营商激活= " true " class = "write_as_text" compatibility="5.3.013" expanded="true" height="76" name="Write as Text" width="90" x="380" y="30">














Answers

  • ReneRene MemberPosts:24Maven
    Hi Diana,

    I'm no expert, but anyway:

    The process shows adbscanclustering, not k-means. What parameters did you try?
    Looks like minpoints and/or epsilon are either too high (= all noise) or too low (= to
    dbscan it seems that all documents are near enough to fit into one single cluster).
    You can use the 'data to similarity' operator to calculate the similarities between
    each document - this might give you a feel for your measure and the right epsilon.

    Try a lower epsilon (let's say 0.1 or 0.01). What do you get, now?
  • awchisholmawchisholm RapidMiner Certified Expert, MemberPosts:458Unicorn
    Hello

    There's an example process that iterates over different parameters for DBScan here

    http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html.

    regards

    Andrew
  • Diana_WegnerDiana_Wegner MemberPosts:4Contributor I
    Thank you very much Andrew and Ren. I'll try your suggestions.

    I tested a number of the models and should have switched back to kmeans before I copied the code.
Sign InorRegisterto comment.