"Regression problem with cross-validation"
dramhampton
MemberPosts:9Contributor II
Hi all
I have a concern about the output from cross-validation with regression. The CV operator should break the data into (say) 10 segments and sequentially use each 10% of the data as a test set for a model built with the other 90% to measure performance - but when reporting out its model, that should be done with all the data, and the predictions made with the model using all the data.
这意味着,如果你有一个单独的属性use as a predictor, and plot the predicted value against this, you should get a straight line.
However, I get a jerky line. This is specific to CV, if I try the same exercise with split validation it works fine.
Am I misunderstanding the way CV works or...?
To make it easier to see the problem I have adapted the Iris dataset to illustrate it, with this process:
Many thanks for your help
David
I have a concern about the output from cross-validation with regression. The CV operator should break the data into (say) 10 segments and sequentially use each 10% of the data as a test set for a model built with the other 90% to measure performance - but when reporting out its model, that should be done with all the data, and the predictions made with the model using all the data.
这意味着,如果你有一个单独的属性use as a predictor, and plot the predicted value against this, you should get a straight line.
However, I get a jerky line. This is specific to CV, if I try the same exercise with split validation it works fine.
Am I misunderstanding the way CV works or...?
To make it easier to see the problem I have adapted the Iris dataset to illustrate it, with this process:
<连接from_op = "表演ance" from_port="performance" to_port="performance 1"/>
<连接from_op = "表演ance" from_port="example set" to_port="test set results"/>
<连接from_op = from_port =“交叉验证测试t result set" to_port="result 3"/>
Many thanks for your help
David
Tagged:
0
Best Answers
-
sgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community ManagerHi David -
OK I understand. This is a common misunderstanding. I'm going to briefly explain here, and due to the fact that this question comes up a LOT, I'm going to write a KB as well.
Basically in short, the "tes" output is the appended application of each Apply Model inside the x-validation, NOT the application of the model on the whole set.
Give me an hour or so to write this KB so you can see what I'm getting at.
Scott
5 -
sgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Managerok pls look at this...sorry abt the formatting
https://community.www.turtlecreekpls.com/discussion/55112/cross-validation-and-its-outputs-in-rm-studio
Scott6
Answers
Yes of course you should get a straight line plotting predicted(a4) vs a2, which I get when I run your process. Where do you see a jerky line?
Scott
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
DH
Of course one might suggest that having another output for the true scored output from the final cross validation model would be a nice enhancement to the cross-validation operator, but that's another discussion!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Dortmund, Germany
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
You could even ask (and I think it's a legitimate question) why the Apply Model needs to be inserted manually on the Testing side of Cross Validation. Is there ever a situation when you do NOT? Wisdom of Crowds shows that people insert it 100% of the time
Call me crazy butI have a hunch that@RalfKlinkenbergand@IngoRMgrappled with these questions a long time ago and likely have good reasons for setting it up this way. Not saying it cannot be changed...just giving these guys the benefit of the doubt that there is a good rationale for doing it the way it's done here.
Great discussion this morning!
Scott
Dortmund, Germany
However, there are other reasons to want to review the scores on the entire input set---for example, if you want to look at score distributions and measure potential score drift over time, you typically are going to start with the baseline of the scores from the original development sample as a comparison point for later samples. Or in the case of another recent thread, the user wanted to confirm the threshold value that was being applied. In fact I recall an earlier bug in one of the learners (logistic regression perhaps) where there was a problem with this and it was only caught because of a similar output analysis of scores on the full population.
@sgenzerI also agree that this is not at all an urgent issue, but simply because it has been handled one way in the past in RapidMiner doesn't necessarily mean that it could not use improvement. There are lots of things that have changed in RapidMiner over the years, and it is always worth a discussion on the merits of any specific idea for future changes.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Dortmund, Germany
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts