"What does cross-validation do with models on each subset?"
Cross-validation is a technique primarily for performance estimation: it allows the training set to also be used as an independent test set. Cross-validation can also be used to prevent overfitting by stopping training when the performance on the left-out set begins to suffer.
How does the XValidation operator work in RapidMiner with respect to the models? Is a new, independent model trained for each subset? Or is it assumed that models used with the XValidation operator allow incremental training, so that each new iteration updates the same model?
If the former, then the resulting model is not trained on the whole dataset, but only one of the XVal iterations, so n-1 subsets.
If the latter, then the model is retrained on n-1 duplicates of every datapoint. To see this, consider a 3-fold cross-validation:
subset 1 subset 2 subset 3
iteration 1: test train train
iteration 2: train test train
iteration 3: train train test
So the model would see subset 1 twice, subset 2 twice and subset 3 twice.
Finally, I haven't seen any documentation that XValidation is used to prevent overfitting. Can someone confirm?
Thanks,
Gary
Tagged:
0
Answers
First of all, let me point you to this very old thread:http://rapid-i.com/rapidforum/index.php/topic,62.0.html
Second, XValidation does exactly what wikipedia tells us regarding k-fold Cross-Validation Third: Yeah. I guess you mean the right thing. "Stopping" means not stopping within the iterations of a cross validation application, but stopping further optimization if the cross validation of your last classification model has shown bad performance. Crossvalidation helps you to prevent overfitting in the sense of showing you a reliable estimate of a classifier's performance.
For more information and really detailed thinking about crossvalidation check out the link to the thesis I posted in the linked thread above.
hope this was helpful,
Steffen
For other readers, the thread confirms that first suggestion of my original post is correct: the models are independent for each fold of the cross-validation.
Thanks,
Gary