Mini Batch K-means in RapidMiner

morita · October 2018

Hi
I have a huge dataset (4000000 records) of text data and I want to do clustering.

Because of memory problems and time complexity of text pre-processing I want to read small batches from database and after pre-processing use mini-batch K-means to cluster data. But I wonder how to use mini-batch clustering in RapidMiner.
Thanks in advance for your answers.

BalazsBarany · October 2018

Hi,

there are different Loop operators in RapidMiner.

You can easily implement this batching behaviour by using a loop with a numeric counter and select data from your database with LIMITnOFFSET(i - 1) * n.

nwould be your preferred batch size, andithe current iteration number, starting at 1. Usually you need to calculate the offset yourself outside of the statement, e. g. with Generate Macro. Not all databases support the LIMIT ... OFFSET syntax, but most have the functionality under a different name.

Regards,

Balázs

morita · October 2018

Hi thanks for your answer

Mini batch K-Means algorithm takes small batches of the dataset for each iteration. It then assigns a cluster to each data point in the batch, depending on the previous locations of the cluster centroids and updates the locations of cluster centroids based on the new points from the batch.
How could I make a process like this?
because loop operator in each iteration makes new clusters for current batch and doesn’t assign new points to previous clusters

@BalazsBarany

BalazsBarany · October 2018

Hi,

for this algorithm you'd need an operator to remember the cluster centroids from the previous clustering and a clustering operator that can take these as it's input. Extract Cluster Prototypes does something like this for the first step but I don't know a way for pushing these into a new clustering.

Regards,

Balázs

JEdward · October 2018

I was actually trying to work on a cluster model that I wanted to update with new data and rather than running the whole thing again planned to use the centroids to update it. (Limited resources on a hadoop cluster mean I can only cluster 1,000,000 records at a time).

This is what I considered which sounds similar to minibatch. About to test it, so maybe you guys could have a look?

The idea was to weight the centroids generated from Extract Cluster Prototypes by simply duplicating them. In my head I figured that would bias it towards that value for centroids, but not necessarily force the cluster to accept them as final-final.
















































<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Stupid way to 'weight the centroids'

<运营商激活= " true "类= compati“追加”bility="9.0.002" expanded="true" height="82" name="Append" width="90" x="447" y="34"/>








<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Add a bit of noise... not sure why, but it feels good.













<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">100 records as a batch

<运营商激活= " true "类= compati“追加”bility="9.0.002" expanded="true" height="103" name="Append (2)" width="90" x="514" y="289"/>

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Mini Batch K-means in RapidMiner

Answers