"Aggregate samples by cluster name and create average sample per cluster"
Hello Rapidminers,
I was wondering how to simply display the results of some clustering. In particular I would love to see the average sample of each cluster, and display all of them in the same window. I have found several ways to do that but none is satisfactory:
- a) Use the "Aggregate" operator with the GroupBy="cluster" and all 100 attributes, one by one (I cant really do this !!), in the "aggregation attributes" parameter (I haven't found any wildcard here to say that I want all real-valued attributes to be averaged)
- b) Use the "Multiply" operator as many times as needed (one per cluster). In each branch use filtering on the "cluster" attribute so that each branch now contains the subset of the sample set corresponding to "cluster_0", "cluster_1",... . Finally transpose the sample set and use the "Generate Aggregation" operator so that a new attribute is created, being the average of all others. Since the transpose operator has been used this new attribute is actually the new sample (the average of the samples in that cluster).
> issue: now I have x different samplesets (one for each cluster) and it seems that there is no operator to put all of them together in a new sampleset.
Is there an easy way to solve this problem, that is really simply an average of all rows belonging to each cluster group ? Maybe with the R plugin ?
Any help would be very much appreciated
Cheers
Sylvain
I was wondering how to simply display the results of some clustering. In particular I would love to see the average sample of each cluster, and display all of them in the same window. I have found several ways to do that but none is satisfactory:
- a) Use the "Aggregate" operator with the GroupBy="cluster" and all 100 attributes, one by one (I cant really do this !!), in the "aggregation attributes" parameter (I haven't found any wildcard here to say that I want all real-valued attributes to be averaged)
- b) Use the "Multiply" operator as many times as needed (one per cluster). In each branch use filtering on the "cluster" attribute so that each branch now contains the subset of the sample set corresponding to "cluster_0", "cluster_1",... . Finally transpose the sample set and use the "Generate Aggregation" operator so that a new attribute is created, being the average of all others. Since the transpose operator has been used this new attribute is actually the new sample (the average of the samples in that cluster).
> issue: now I have x different samplesets (one for each cluster) and it seems that there is no operator to put all of them together in a new sampleset.
Is there an easy way to solve this problem, that is really simply an average of all rows belonging to each cluster group ? Maybe with the R plugin ?
Any help would be very much appreciated
Cheers
Sylvain
Tagged:
0
Answers
I have thought of another way : maybe I can use a "script" operator in order to generate the correct inputs (the list of all real-valued attributes) and then pass them to the "Aggregate" operator. I came up with the following java code to use in the "script" operator, but I still have to dig the java sources to understand how to correctly pass the parameters and trigger the "Aggregate" operator from there.
Do you think this has a chance to work ? Thanks in advance for your help
Best regards
Sylvain
if it's a centroid based clustering this is already shown in the result screen.
Otherwise I must admit, you have a problem. Well, I wasn't aware that this isn't possible. Please write a feature request for that in the bug tracker.
Greetings,
Sebastian