What about n models generated in cross validation? Should we not take avg of all models (Linear reg)

binaytamrakarbinaytamrakar MemberPosts:5Contributor I
edited November 2018 inHelp

I have a question regarding cross validation in Linear regression model.

From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.

When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances and I can see that clearly when I use the operator "write as Text".

But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different data and are supposed to be different, right?)

I apologize if my question is not clear or too funny.

Thanks for reading, though!

Best Answer

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Solution Accepted

    Hi,

    This is not a funny question at all - I would even go so far and say this is probably one of the most frequently asked questions in machine learning I heard in my life :smileyhappy:

    Let me get straight to the point here: Cross validation isnotabout model buildingat all. It is a common scheme to estimate (not calculate! - subtle but important difference) how well agivenmodel will work on unseen data. So the fact that we deliver a model at the end (for convenience reasons) might lead you to the conclusion that it actually is about model building as well - but this is just not the case.

    Ok, here is why this validation is an approximation of an estimation for agivenmodel only: typically you want to use as much data as possible since labeled data is expensive and in most cases the learning curves show you that more data leads to better models. So you build your model on the complete data set since you hope this is the best model you can get. Brilliant! This is thegivenmodel from above. You could now gamble and use this model in practice, hoping for the best. Or you want to know in advance if this model is really good before you use it in practice. I prefer the latter approach ;-)

    So only now (actually kind ofafteryou built the model on all data) you are of course also interested in learning how well this model works in practice on unseen data. Well, the closest estimate you could do is a so-called leave-one-our validation where you use all but 1 data points for training and the one you left out for testing. You repeat this for all data points. This way, the models you built are "closest" to the one you are actually interested in (since only one example is missing) but unfortunately this approach is not feasible for most real-world scenarios since you would need to build 1,000,000 models for a data set with 1,000,000 examples.

    Here is where cross-validation enters the stage. It is just a more feasible approximation of something which already was only an estimation to begin with (since we ommitted one example even in the LOO case). But this is still better than nothing. The important thing is: It is a performance estimation for theoriginalmodel (built on all data), andnota tool for model selection. If at all, you could misuse a cross-validation as a tool for example selection but I won't go into this discussion now.

    旁边的这个:你可能会有一个想法如何断言age 10 linear regression models - what do we do with 10 neural networks with different optimized network structures? Or 10 different decisions trees? How to average those? In general this problem can not be solved anyway.

    You might enjoy reading this older discussion where I spend more time discussion the different options besides averaging:http://community.www.turtlecreekpls.com/t5/RapidMiner-Studio/Interpretation-of-X-Validation/m-p/9204

    The net is: they are all not a good idea at all and you should do the right thing. Which is built one model on as much data as you can and use cross-validation to estimate how wellthismodel will perform on new data.

    Hope that clarifies this,

    Ingo

    MartinLiebig yyhuang

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist

    I second of course everything what Ingo said. But i would like to add one more punch line:

    (Cross-)Validation is not about validating a model but about validating the method to generate a model.

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    BalazsBarany
  • binaytamrakarbinaytamrakar MemberPosts:5Contributor I

    Dear Ingo,

    Thank you so much Ingo, This is probably the best explanation I ever had. Now, it all makes sense to me.
    You really made my day!

    Binay

    stevefarr
  • binaytamrakarbinaytamrakar MemberPosts:5Contributor I

    Thanks Schmitz,

    Yes, that totally makes sense to me now.

    Binay

Sign InorRegisterto comment.