"What does cross-validation do with models on each subset?"

DrGaryDrGary MemberPosts:8Contributor II
edited May 2019 inHelp

Cross-validation is a technique primarily for performance estimation: it allows the training set to also be used as an independent test set. Cross-validation can also be used to prevent overfitting by stopping training when the performance on the left-out set begins to suffer.

How does the XValidation operator work in RapidMiner with respect to the models? Is a new, independent model trained for each subset? Or is it assumed that models used with the XValidation operator allow incremental training, so that each new iteration updates the same model?

If the former, then the resulting model is not trained on the whole dataset, but only one of the XVal iterations, so n-1 subsets.

If the latter, then the model is retrained on n-1 duplicates of every datapoint. To see this, consider a 3-fold cross-validation:

subset 1 subset 2 subset 3
iteration 1: test train train
iteration 2: train test train
iteration 3: train train test


So the model would see subset 1 twice, subset 2 twice and subset 3 twice.

Finally, I haven't seen any documentation that XValidation is used to prevent overfitting. Can someone confirm?

Thanks,
Gary

Answers

  • steffensteffen MemberPosts:347Maven
    Hello DrGary

    First of all, let me point you to this very old thread:http://rapid-i.com/rapidforum/index.php/topic,62.0.html

    Second, XValidation does exactly what wikipedia tells us regarding k-fold Cross-Validation

    K-fold cross-validation

    In K-fold cross-validation, the original sample is partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K − 1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. The K results from the folds then can be averaged (or otherwise combined) to produce a single estimation. The advantage of this method over repeated random sub-sampling is that all observations are used for both training and validation, and each observation is used for validation exactly once. 10-fold cross-validation is commonly used [5].

    In stratified K-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly the same proportions of the two types of class labels.
    Third:
    DrGary wrote:

    Cross-validation can also be used to prevent overfitting by stopping training when the performance on the left-out set begins to suffer.
    Yeah. I guess you mean the right thing. "Stopping" means not stopping within the iterations of a cross validation application, but stopping further optimization if the cross validation of your last classification model has shown bad performance. Crossvalidation helps you to prevent overfitting in the sense of showing you a reliable estimate of a classifier's performance.

    For more information and really detailed thinking about crossvalidation check out the link to the thesis I posted in the linked thread above.

    hope this was helpful,

    Steffen
  • DrGaryDrGary MemberPosts:8Contributor II
    Steffen, thanks. The first link was helpful.

    For other readers, the thread confirms that first suggestion of my original post is correct: the models are independent for each fold of the cross-validation.

    Thanks,
    Gary
Sign InorRegisterto comment.