text mining - all records placed in one cluster

Diana_Wegner · October 2013

I have a follow-up question to my previous text mining issue. K-means, like many of the other clustering modules, places all of my 3000 records in one cluster. I've tried different parameters with no luck. Do you have any hints to resolve this issue? Random Clustering is the only one that generates the number of clusters requested, however the results are not what I expected.

Here is the xml... THANKS!!

on="1.0" encoding="UTF-8" standalone="no"?>

< >输出
//Local Repository/Result 1 Process Document Cluster
//Local Repository/Result 2 clustering
//Local Repository/Result 3

<运营商激活= " true " class = "process" compatibility="5.3.013" expanded="true" name="Process">

<运营商激活= " true " class = "read_csv" compatibility="5.3.013" expanded="true" height="60" name="Read CSV" width="90" x="45" y="255">

<运营商激活= " true " class = "text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="255">

<运营商激活= " true " class = "text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize (2)" width="90" x="45" y="30"/>
<运营商激活= " true " class = "text:transform_cases" compatibility="5.3.002" expanded="true" height="60" name="Transform Cases (2)" width="90" x="179" y="30"/>
<运营商激活= " true " class = "text:filter_stopwords_english" compatibility="5.3.002" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="313" y="30"/>
<运营商激活= " true " class = "text:filter_by_length" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="447" y="75">

<连接过滤器Stopwor from_op = "ds (2)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>

<运营商激活= " true " class = "dbscan" compatibility="5.3.013" expanded="true" height="76" name="Clustering" width="90" x="246" y="30"/>
<运营商激活= " true " class = "write_as_text" compatibility="5.3.013" expanded="true" height="76" name="Write as Text" width="90" x="380" y="30">

Rene · October 2013

Hi Diana,

I'm no expert, but anyway:

The process shows adbscanclustering, not k-means. What parameters did you try?
Looks like minpoints and/or epsilon are either too high (= all noise) or too low (= to
dbscan it seems that all documents are near enough to fit into one single cluster).
You can use the 'data to similarity' operator to calculate the similarities between
each document - this might give you a feel for your measure and the right epsilon.

Try a lower epsilon (let's say 0.1 or 0.01). What do you get, now?

awchisholm · October 2013

Hello

There's an example process that iterates over different parameters for DBScan here

http://rapidminernotes.blogspot.co.uk/2010/12/counting-clusters.html.

regards

Andrew

Diana_Wegner · October 2013

Thank you very much Andrew and Ren. I'll try your suggestions.

I tested a number of the models and should have switched back to kmeans before I copied the code.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

text mining - all records placed in one cluster

Answers