逻辑回归大型数据集:RapidMiner vs. SAS
Hi!
Background: I am a consultant working with a customer to replace SAS with Rapidminer studio (not server). Most of the analysts work on developing marketing scorecards (logistic/decision tree).
I have read the most of the informative blogs but please excuse me for re-posting some of the niggling questions
1. SAS vs. Rapidminer: Predictions using the software will not match due to difference in underlying technique. How does Customer validate historical predictions going forward (Model development in SAS but validation using Rapidminer)?
2. Prediction vs. Explanation: My customer uses the beta coefficients and odds ratio to derive insights. In Rapidminer, how will they read and interpret the weights of explanatory variables?
3. Small vs. Large Data set: Customer currently has 1million records and 3000 attributes which is analysed on an 8GB Ram Dell Inspiron 5000 series laptop. Customer is not keen on using sampling/extrapolation route of analysis nor wants to upgrade to server version at this stage of transition (SAS to Rapidminer). What are the alternatives?
a. Pre-processing: What will be the loop/macro design to run step-wise logistic regression?
b. Radoop/Stream Database: Is this an option they can adopt to run logistic regression?
Background: I am a consultant working with a customer to replace SAS with Rapidminer studio (not server). Most of the analysts work on developing marketing scorecards (logistic/decision tree).
I have read the most of the informative blogs but please excuse me for re-posting some of the niggling questions
1. SAS vs. Rapidminer: Predictions using the software will not match due to difference in underlying technique. How does Customer validate historical predictions going forward (Model development in SAS but validation using Rapidminer)?
2. Prediction vs. Explanation: My customer uses the beta coefficients and odds ratio to derive insights. In Rapidminer, how will they read and interpret the weights of explanatory variables?
3. Small vs. Large Data set: Customer currently has 1million records and 3000 attributes which is analysed on an 8GB Ram Dell Inspiron 5000 series laptop. Customer is not keen on using sampling/extrapolation route of analysis nor wants to upgrade to server version at this stage of transition (SAS to Rapidminer). What are the alternatives?
a. Pre-processing: What will be the loop/macro design to run step-wise logistic regression?
b. Radoop/Stream Database: Is this an option they can adopt to run logistic regression?
Tagged:
0
Answers
1) My comment above might explain the differences between SAS and RM.
2) See the Weka operator. You get both beta coefficients and odd ratios.
3) I don't see why this might be a problem.
4) You can do forward/backward variable selection.
5) No clue.
5) Radoop IMO wouldn't be necessary for such a small sample of data. 1,000,000 records is really laptop size these days. Radoop is also really useful once your client has invested in hadoop infrastructure for storing the data across multiple servers, it sounds like they aren't at this point yet; if they don't have a cluster they don't need Radoop.
I'll also add a few extra comments on the first few:
1) you can bring scored data into RapidMiner from other tools and mark a label attribute & a prediction attribute. This means that all RapidMiner's evaluation methods can be used (for example T-Tests, etc).
3) I recently used w-logistic on a 2 million record set on my 16GB laptop, you should be fine. If you do run into problems let us know because there's always ways.
RapidMiner看似很灵活。
it is wonderful to see that you try to replace SAS! If you need any help, you can also contact me directly at mschmitz at rapidminer dot com. I think this is something which should be supported from our professional services.
Best,
Martin
Dortmund, Germany