text mining - all records placed in one cluster
Diana_Wegner
MemberPosts:4Contributor I
I have a follow-up question to my previous text mining issue. K-means, like many of the other clustering modules, places all of my 3000 records in one cluster. I've tried different parameters with no luck. Do you have any hints to resolve this issue? Random Clustering is the only one that generates the number of clusters requested, however the results are not what I expected.
Here is the xml... THANKS!!
on="1.0" encoding="UTF-8" standalone="no"?>
< >输出
//Local Repository/Result 1 Process Document Cluster
//Local Repository/Result 2 clustering
//Local Repository/Result 3
<运营商激活= " true " class = "process" compatibility="5.3.013" expanded="true" name="Process">
<运营商激活= " true " class = "read_csv" compatibility="5.3.013" expanded="true" height="60" name="Read CSV" width="90" x="45" y="255">
<运营商激活= " true " class = "text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="255">
<运营商激活= " true " class = "text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
<运营商激活= " true " class = "text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="30"/>
<运营商激活= " true " class = "text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="30"/>
<运营商激活= " true " class = "text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="75">
<连接过滤器Stopwor from_op = "ds (2)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<运营商激活= " true " class = "dbscan" compatibility="5.3.013" expanded="true" height="76" name="Clustering" width="90" x="246" y="30"/>
<运营商激活= " true " class = "write_as_text" compatibility="5.3.013" expanded="true" height="76" name="Write as Text" width="90" x="380" y="30">
Here is the xml... THANKS!!
on="1.0" encoding="UTF-8" standalone="no"?>
< >输出
<运营商激活= " true " class = "process" compatibility="5.3.013" expanded="true" name="Process">
<运营商激活= " true " class = "read_csv" compatibility="5.3.013" expanded="true" height="60" name="Read CSV" width="90" x="45" y="255">
<运营商激活= " true " class = "text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="255">
<运营商激活= " true " class = "text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
<运营商激活= " true " class = "text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="30"/>
<运营商激活= " true " class = "text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="30"/>
<运营商激活= " true " class = "text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="75">
<连接过滤器Stopwor from_op = "ds (2)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<运营商激活= " true " class = "dbscan" compatibility="5.3.013" expanded="true" height="76" name="Clustering" width="90" x="246" y="30"/>
<运营商激活= " true " class = "write_as_text" compatibility="5.3.013" expanded="true" height="76" name="Write as Text" width="90" x="380" y="30">
Tagged:
0
Answers
I'm no expert, but anyway:
The process shows adbscanclustering, not k-means. What parameters did you try?
Looks like minpoints and/or epsilon are either too high (= all noise) or too low (= to
dbscan it seems that all documents are near enough to fit into one single cluster).
You can use the 'data to similarity' operator to calculate the similarities between
each document - this might give you a feel for your measure and the right epsilon.
Try a lower epsilon (let's say 0.1 or 0.01). What do you get, now?
There's an example process that iterates over different parameters for DBScan here
http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html.
regards
Andrew
I tested a number of the models and should have switched back to kmeans before I copied the code.