Process of X-means cluster with text data

Joanneyu · November 2019

Hi all,

I want to do x-means cluster with text data, but I am super new with Rapidminer. I followed several different tutorials and ended up with this process.
My data looks like the excel format at left hand side, where I have only one column with several single words.

If would be so nice if someone can confirm whether the process is right or wrong. I want to use X-means cluster because I want to see what is the ideal number of clusters. I am using TF-IDF, and Inside "process document from data", there are tokenize, transform cases, stopwords, and stem (poter). As for "X-Means", I set the k min of 10 and k max 60, with Cosine similarity.

Image: https://scontent-vie1-1.xx.fbcdn.net/v/t1.15752-9/78434949_511896169396269_9040932357380505600_n.png?_nc_cat=105&_nc_ohc=Bq_oDVPohJ8AQmHGL4LyWeSnf7WThCILRs2SAg_gzQnYoAz0LZFuyn3OQ&_nc_ht=scontent-vie1-1.xx&oh=bb521ef990fa3fecfc27b7a1ef7d1aa3&oe=5E47935E

However, the results appear weird to me because cluster 0 has almost all the data. Also, I expected that the results will tell me what would be the most ideal number of clusters? Or did I make any mistake in the process?

Image: https://scontent-vie1-1.xx.fbcdn.net/v/t1.15752-9/78890556_548710235706695_8703660597239087104_n.png?_nc_cat=108&_nc_ohc=sSX7LJoUVJEAQl_jIsaEQUdFdxNngKBn23v0CCf7Eg_kTHwdMh9WOLOcQ&_nc_ht=scontent-vie1-1.xx&oh=599ae7c52a6bdc38cf667e3faa07932d&oe=5E4AF338

Thank you in advance!!!

sgenzer · December 2019

嗨@Joanneyuthere's nothing that I can see wrong with your process (although I must say using Auto Model is MUCH easier than what you're trying to do here with operators). Having one cluster with almost all the items is not unusual per se; could be a very homogenous group, or you're not creating enough/the right features to find differences in your texts.
我再次尝试汽车模型。

Scott

Howdy, Stranger!

Quick Links

Categories

RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Process of X-means cluster with text data

Answers