How do I split up scored data into 20 equally sized segments?
simon_philipose
MemberPosts:3Learner I
Hi there-- still only a few days into using RapidMiner and wasn't sure if/how I could go about doing the following:
I created a logistic regression model for direct mail marketing. I've scored my model onto new data but what I want to be able to do is split the scored data up into 20 different groups based on their descending confidence(responder) value resulting in the A's having 1/20th of the most likely responders, the Bs having 1/20th of the next most likely and so on.
Your help is much appreciated.
-Simon
Tagged:
0
Best Answer
-
rfuentealba Moderator, RapidMiner Certified Analyst, Member, University ProfessorPosts:568UnicornHi@simon_philipose, and be welcome to the community!
Well, I have a few things for you today. I wouldn't want to be rude telling you how to manage your business, but before working on a solution, I wanted to give you some small advice:
Old Man's Advice:Now,Counterintuitive is my codename, hence I decided to see if I could solve your issue. Here it is.
First of all, please take note that what you are asking is counterintuitive from a business perspective. Normally you would not wantequal size bins, if you have, e.g., 100 examples divided in 10 groups and 40 of these have0.1confidence, groups 7, 8, 9 and 10 would all be the same thing. If you want to apply certain kind of rule system on these but then you nail it with your next campaign (and that happens!), you will have to change your rule system, and fine tune it on every mail campaign.
如果我是你,一个d having worked with e-mail marketing systems in the past, I would havediscretizedby one of the options that are already available in RapidMiner.
I separated your problem in two subprocesses and an operator:
- Rank
- Bin
- Clean
This is how the overall process works.
在Ranksubprocess, I used theSortoperator to sort by what would be yourConfidence Value, then usedGenerate IDto add a number, and thenSet Roleto not use that generated ID as an ID role, because we are going to do math with it. This is how it looks.
在Binsubprocess, I used theExtract Macrooperator to extract the number of examples into a variable named MaxID, then usedGenerate Attributesto introduce a new attribute with a small calculation, which is100 / MaxID * id
However, sinceMaxIDis a scalar value coming from theExtract Macrooperator, I need to put it inside%{}and theneval()it, because macros are usually nominal or text (can't remember which one, doesn't matter for this explanation)
100 / eval(%{MaxID}) * id
Finally, I used theDiscretize by Binningoperator to generate 20 bins based in theGroup_Modelattribute. That overwrites the value stored there with therange. You can discretize by user specification too, if you want to change the names ofrange1,range2,range3... or use theReplaceoperator to changerangebyGroupor whatever you want. As usual, I wouldn't want to take the fun of exploring RapidMiner from you. This is how it looks:
And theCleanstage is only aSelect Attributesoperator, meant to remove the ID we generated at the beginning.
请查收附加过程ss.
Hope it helps,
Rodrigo.8
Answers
You can first use Sort operator to Sort confidence values with the descending order, followed by Split data operator.
In split data operator Parameter window; add partition ratio = 1/20
Hope this helps.
Cheers,
Pavithra
Hi Pavithra,
Thank you for your response. So I ran into a few problems with using the Split Data operator.
1. It splits the dataset into multiple datasets. What I need is one data set but with a field called Model_Group with a value of A, B, C, D, etc. depending on the confidence values.
2. It appears the maximum number of data sets I can split is 8 by putting .125 in the partions ratio field 8 times. I can't do 10, much less 20 different splits.
i would do the following:
Sort - by confidence
Generate ID - to get a index
Use Generate attributes with id%10 to get your Model_Group
Best,
Martin
Dortmund, Germany
If you copy your score attribute first, Discretize by Frequency should be able to do this directly for your score attribute by selecting that attribute and setting the number of bins to 20. This will create exactly the bins you are looking for, although if there are a large number of ties this can sometimes cause problems for the Discretize operators. (The reason you copy the score first is Discretize will replace your selected attribute with a new attribute, so if you still want to have the raw score, you will need two copies of it, one which is binned and one which is not).
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts