Auto Model and overfitting
I've been experimenting with Auto Model for Prediction and am generally happy with the concept and results.
In the Auto Model process the sampling is set to 80/20. Is this sufficient to control potential overfitting? I am getting performance ranging from about 60% accuracy for Naive Bayes to 87% accuracy for GBT. I have less than 1000 rows of data and 20 attributes for each data set. GBT is generating about 20 trees. (I would potentially be operationalising with 100's of datasets and dedicated models per dataset)
Tagged:
1
Answers
hello@dgarrard- I think it is always prudent to be on alert for overfitting, regardless whether it's using Auto Model or using the "normal" RapidMiner methods. We all know that some models such as neural networks are prone to overfitting and should be used with caution, particularly on small data sets.
My personal opinion is that the 80/20 split is widely used and is, in general, a reasonable split ratio and should be sufficient to avoid overfitting if used in conjunction with methods such as cross-validation (which is default in Auto Model).
In the end, I always look at results with skepticism irrespective of the tool used until I actually inspect them to see how my "fit" looks on unseen data.
Hope that helps.
Scott
Thank you for the quick reply Scott. I'll try to get some testing done in the next couple weeks while my Auto-Model trial is still available!
David
Hi,
I just have a couple of minutes so I thought I will weigh in a bit on this topic :smileyhappy: This is in fact really important to me. So here we go...
I personally think it is important to note that those are actually two somewhat unrelated questions:
Ok, on question 1: "Is there overfitting in my model?". The simple answer is: "Yes, always."
Overfitting always happens indepedent of the training / test split. It happens for all models and all data sets - sometimes more, sometimes less. I sometimes think we are too focused on the term "overfitting" while what youreallywant to know ishow well the model will perform in the future. Yes, there will be some overfitting (which is why the performance on the training data is always somewhat lower than on the test data). But thetrainingperformance (and as a consequence the measurement of the degree of overfitting) is kind of irrelevant. At least, as I said above, if what you want is evaluating the model correctly for expected future performance. Which I guess is what the whole validation should be about in the first place.
Here is an example to make this a bit clearer. If your test data shows an accuracy of 80%, it will be about 80% also in the future. It does not really matter if the training accuracy was 81% (i.e. hardly any overfitting happened) or 90% (i.e. a lot of overfitting happened). The model will roughly perform with 80% accuracy irrelevant of the amount of overfitting and the training error. And this is what really counts in my opinion.
So why is the training / test split ratio then important at all? It actually is somewhat connected to the overfitting question discussed above. If you providemoretraining data, most models will generalize better (better test accuracy, less overfitting, or both). Even if you ignore measuring the degree of overfitting, you still want to get better test accuracies...
So should we just go with as much training data as possible then? Well, here is the catch. The split really has an impact on the second question: "is your test data sufficient to give you a good estimation of what is really relevant, namely the test performance?"
If there is not enough data in your test set, you may run into the issue that the test data sample is particular easy (or hard) and your estimation of future performance might be far off from reality. In order to reduce this risk, I practically always use cross validation at least before I take a model into production. Simply because cross-validation reduces the selection bias for the test set.
So here is what I consider best practice:
If you want to read more on the whole validation topic, I wrote a 26 page white paper on this some time ago :-) Here is the link:
//www.turtlecreekpls.com/resource/correct-model-validation/
Sorry for the long email. You very likely knew all of this already in which case just ignore this please. I just wanted to take the chance to write some thoughts down and make them publicly available ;-)
Best and happy modeling,
Ingo
Hi this is very helpful, thank you. But i do have a follow up question...is the auto model showing a testing set accuracy or a training set accuracy in the results view? Because I ran a GBT in auto model on 4500 lines of data with 15 features, received "accuracy" of 90% and f-measure of 84%, but when i applied the model to new unseen data (which i actually purposely held-out from the training and cross validation process), the accuracy rates declines to below 50%. So I am not sure if I am running the validation process incorrectly, or perhaps not understanding what the results of the CV are telling me - as I had expected the auto model to produce an accuracy rate that was reflective of how well the model will perform in the future. Thanks much.
Hi,
sorry for the delay, I missed this one here. It shows the testing error of course. If you read my correct validation opus linked above, you will see that we would NEVER care about training errors in the first place ;-)
Such a drop can either be caused by a (significant) change in data distributions between training and validation sets. Or, what I personally find more likely given the high amount, you probably did not apply exactly the same data preparation on your validation set. More about this in the other thread here:
https://community.www.turtlecreekpls.com/t5/RapidMiner-Auto-Model-Turbo-Prep/Is-auto-model-showing-test-or-train-error/m-p/50902/highlight/false#M117
Hope this helps,
Ingo