"[SOLVED] Question: Cross-Validation"
Hello everyone,
while learning RapidMiner, I came across X-Validation (which is a useful thing!), but how does it exactly work?
Let's assume, we've got a data set of 100 examples and want to build a decision tree and the number of validation is 10.
There are (at least) 2 possibilities:
a) The output model is the decision tree based on the 100 examples, but the performance is always trained with 90 examples and tested with 10 examples (so the tree might always be different than the actual output tree!)
b) The output model is the decision tree based on the 100 examples and the performance is tested with 10 * 10 examples on the output tree.
After reading the description of X-Validation, I think a) is correct, but b) makes more sense, since the decision tree in a) might always be different than the actual output tree.
Which alternative is correct and if it is a) am I right that the tree might always be different?
Cheers Q-Dog
while learning RapidMiner, I came across X-Validation (which is a useful thing!), but how does it exactly work?
Let's assume, we've got a data set of 100 examples and want to build a decision tree and the number of validation is 10.
There are (at least) 2 possibilities:
a) The output model is the decision tree based on the 100 examples, but the performance is always trained with 90 examples and tested with 10 examples (so the tree might always be different than the actual output tree!)
b) The output model is the decision tree based on the 100 examples and the performance is tested with 10 * 10 examples on the output tree.
After reading the description of X-Validation, I think a) is correct, but b) makes more sense, since the decision tree in a) might always be different than the actual output tree.
Which alternative is correct and if it is a) am I right that the tree might always be different?
Cheers Q-Dog
Tagged:
0
Answers
the cross validation operator indeed works like Option a). Option b) does not make any sense at all: here you would evaluate on the training data - which is exactly the thing you want tonotdo with cross validation!.
For more information read my first answer in the following thread:
http://rapid-i.com/rapidforum/index.php/topic,62.0.html
And also Steffen's answer in this thread:
http://rapid-i.com/rapidforum/index.php/topic,959.msg3598.html
In short: don't confuse error estimation (done by cross validation) with model generation. The latter is done within cross validation but talking about these models is not useful and outside of the cross validation on the complete data. The model port of the operator is just for convenience reasons in order to get both, the model and the estimated performance for this type of model.
Cheers,
Ingo
Especially the two links made it crystal clear to me