"Regression problem with cross-validation"

dramhamptondramhampton MemberPosts:9Contributor II
edited May 2019 inHelp
Hi all

I have a concern about the output from cross-validation with regression. The CV operator should break the data into (say) 10 segments and sequentially use each 10% of the data as a test set for a model built with the other 90% to measure performance - but when reporting out its model, that should be done with all the data, and the predictions made with the model using all the data.

这意味着,如果你有一个单独的属性use as a predictor, and plot the predicted value against this, you should get a straight line.

However, I get a jerky line. This is specific to CV, if I try the same exercise with split validation it works fine.

Am I misunderstanding the way CV works or...?

To make it easier to see the problem I have adapted the Iris dataset to illustrate it, with this process:

<连接from_op = "表演ance" from_port="performance" to_port="performance 1"/>
<连接from_op = "表演ance" from_port="example set" to_port="test set results"/>
<连接from_op = from_port =“交叉验证测试t result set" to_port="result 3"/>


Many thanks for your help

David

Best Answers

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    Hi David -

    Yes of course you should get a straight line plotting predicted(a4) vs a2, which I get when I run your process. Where do you see a jerky line?




    Scott
    varunm1
  • dramhamptondramhampton MemberPosts:9Contributor II
    Oops I forgot to mention something! I added an additional Apply Model operator after Cross-validation to show what you should get, and that produces a straight line. Now disable this second Apply Model and you will see the direct output from CV. Many thanks Scott!
  • Telcontar120Telcontar120 主持人,RapidMiner注册分析师RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    Yes@sgenzerI think this would be a very helpful KB article. This question does come up a lot!
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • dramhamptondramhampton MemberPosts:9Contributor II
    Many thanks Scott. That's cracked it. The workaround to insert a new Apply Model operator will work well and I will be able to explain to people why it is needed. Very helpful!
    DH
    sgenzer
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    great. Glad that helped. I'd like to use this article for other purposes so please provide suggestions if something is not clear. Same of course for everyone else...@Telcontar120 :wink:
  • Telcontar120Telcontar120 主持人,RapidMiner注册分析师RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    @sgenzerthis looks great to me...I think that color shading on the "tes" output results really clarifies things.
    Of course one might suggest that having another output for the true scored output from the final cross validation model would be a nice enhancement to the cross-validation operator, but that's another discussion!
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    how so? There is no way to apply the final model on the training data.
    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 主持人,RapidMiner注册分析师RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    @mschmitzwhat do you mean? It's mechanically possible, in the sense that you can accomplish the same thing simply by outputting or storing the final model from cross-validation, then applying the model on the full dataset used as the cross-validation input (just as noted earlier in the forum thread). So I am not sure what you mean by "there is no way to apply the final model on the training data". We could debate whether this is a useful thing to have or not, but I think it is definitely a possible thing to produce.

    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    all good points - it's always the same challenge of how much to bundle into one operator. Do you build in Apply Model to see the model applied on the entire data set, or leave it as is? I would advocate for the latter. But a better question is why do we port the testing output at all. Does it serve any purpose? And yet if the purpose of Cross Validation is purely to find a true estimate of performance, why do we port the model at all? But then you get into this world which does NOT seem "fast and simple"...



    You could even ask (and I think it's a legitimate question) why the Apply Model needs to be inserted manually on the Testing side of Cross Validation. Is there ever a situation when you do NOT? Wisdom of Crowds shows that people insert it 100% of the time笑脸:



    Call me crazy butI have a hunch that@RalfKlinkenbergand@IngoRMgrappled with these questions a long time ago and likely have good reasons for setting it up this way. Not saying it cannot be changed...just giving these guys the benefit of the doubt that there is a good rationale for doing it the way it's done here.

    Great discussion this morning!

    Scott

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    extactly this is statistically not sound. You cannot trust scores which are the result of this. You may have overtrained results.
    BR,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 主持人,RapidMiner注册分析师RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    @mschmitzI completely agree with your point about overfitting, as you should probably already know from our many earlier discussions about this topic:smile:If the main purpose of the output would be to assess performance then it is not nearly as useful as the cross-validation performance output, which is already coming out of the operator.

    However, there are other reasons to want to review the scores on the entire input set---for example, if you want to look at score distributions and measure potential score drift over time, you typically are going to start with the baseline of the scores from the original development sample as a comparison point for later samples. Or in the case of another recent thread, the user wanted to confirm the threshold value that was being applied. In fact I recall an earlier bug in one of the learners (logistic regression perhaps) where there was a problem with this and it was only caught because of a similar output analysis of scores on the full population.

    @sgenzerI also agree that this is not at all an urgent issue, but simply because it has been handled one way in the past in RapidMiner doesn't necessarily mean that it could not use improvement. There are lots of things that have changed in RapidMiner over the years, and it is always worth a discussion on the merits of any specific idea for future changes.
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
    sgenzer
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    @Telcontar120but where is the problem with the tes port? That gives you a fair estimate of these distributions

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Telcontar120Telcontar120 主持人,RapidMiner注册分析师RapidMiner Certified Expert, MemberPosts:1,635Unicorn
    @mschmitzthey may provide a fair estimate but are not actually generated using the same model. So from a compliance perspective, they may not be sufficient. There are many regulated industries in the US where this would not be an acceptable starting point for model performance tracking.
    Brian T.
    Lindon Ventures
    Data Science Consulting from Certified RapidMiner Experts
Sign InorRegisterto comment.