Text clustering and labeling
amir_askary_sha
MemberPosts:11Contributor I
Hi,
我'm using Rapidminer for text clustering (kmeans) and then labeling the clusters. We have usually around 2000 documents and the texts are in German. The texts are short (title and short description of news or articles) and so far Rapidminer is working nice! In the text processing phase, I use Term Frequency vectors, instead of so commonly used TF-IDF, as I feel Term Frequency in our case works better.
我have now some questions.
- How can we label the clusters nicely? Like human readable titles.
- After running K-means, how can I see the top relevant document in a cluster? (As a try, I want to use simply the title of this document as the title of the whole cluster)
- I have trained a classification model (KNN), to put first the documents in some known groups(politics, sport, etc), and the run the main clustering process on documents in each group; to achieve a nicer two level clustering. But I don't know how I can connect the result of the classification process to the clustering process in order to have the whole process automated (instead of running the classification, and then marking the documents of each group manually and then running clustering on those documents)
Thank you in advance for your help.
Tagged:
0
Answers
我once did something myself using the top 3 keywords provided for each cluster to generate a more friendly naming, but can't recall anymore how I did it in detail.
Was something as : cluster as label role - process documents again but only tokenize, not need to get vectors, just tokens as wordlist -> wordlist to exampleset -> loop attributes and get top 3 for each cluster -> aggregate (concatenate) so you get your cluster header and one example containing the top 3. next loop attributes again, store the value in a macro and use this to rename the attribute.
probably not the most convenient way to do it but it did the trick for me at that time. If your keywords are properly defined (so no stopwords and only relevant ones) it can be a bit easier already then using the default challenging cluster label.
Don't think there is another way to create some default friendly lables, given the fact your cluster are heavily depending on the data provided and are therefore pretty volatile by nature.
thanks for your answer. can you please elaborate more on the steps? I don't get them exactly what to do. wordlist to exampleset? rename the attribute?
Basically when you have done your clustering you will have your text and the predicted cluster. The base id is to reuse this to get the topwords assigned by a given cluster.
maybe this can get you started, I did not find the time to get it fully working but it's close to final. You only need to provide your original text and cluster as label, and you get an exampleset in return with a proposed new label. Only thing missing is the final loop logic to rename your actual attributes.
Hi Keyman,
thank you very much for your help & time.
我t doesn't work as it is. You see my process as xml here; I've added the clustering part as a sub-process as needed.
我n the last step, Append, it says wrong process type, and the result of text processing part is an empty table.
@kaymanHi Keyman, I can't get it work. I get some errors. Could you please read my previous post and help me to get it run? Thanks
Hi, it's a bit complex to spot since there is no data to work with, did you try to follow the data to see where it goes missing? So just using a breakpoint on each major step?
Otherwise feel free to send me a real data sample so I can try to generate and see where the issue migh be.
@kayman
我think it's not about the data but I provided you here 100 sample documents that I use anyway in xml format:
And here is again my full process (including the part that you posted):
You were very close :-)
One more thing you need to do, in the Set role operator also change the text to regular, and then it will work. If you don't the text file is considered a special attribute and it is ignored by your nominal to text operator.
Then you get something like this :