Split-Validation Issue
@sgenzer我相信,可能会有一些问题-validation operator. The model output through the entire split-validation process does not correspond to the model with which the validation performance metrics are computed.
I have attached an Excel spreadsheet to show the computations with a formula. The RMSE computed for the validation dataset (using the Performance operator) corresponds to the "ValidModel and ApplyModel" (in Excel worksheet) which is one of the models output by the process when dissected through a remember/recall operators and breakpoints. However, the RapidMiner process outputs a LinearRegression model that is same as the "TrainModel" (in Excel worksheet) whose RMSE does not match the one given by the Performance (Regression) operator. Why the discrepancy? Which is the correct model here?
I have tried this issue with multiple datasets and have documented it in a process with the sample Polynomial dataset. Any ideas on what may be going on here?
I have attached an Excel spreadsheet to show the computations with a formula. The RMSE computed for the validation dataset (using the Performance operator) corresponds to the "ValidModel and ApplyModel" (in Excel worksheet) which is one of the models output by the process when dissected through a remember/recall operators and breakpoints. However, the RapidMiner process outputs a LinearRegression model that is same as the "TrainModel" (in Excel worksheet) whose RMSE does not match the one given by the Performance (Regression) operator. Why the discrepancy? Which is the correct model here?
I have tried this issue with multiple datasets and have documented it in a process with the sample Polynomial dataset. Any ideas on what may be going on here?
Tagged:
0
Best Answer
-
lionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195UnicornHi@avd,
I will try to explain this difference (thanks to the RM staff to correct me if I'm wrong) :
There are in deed 2 models built in this process :
The first one is built with 60% of the data and then is tested with the 40% of remaining data : You called this model ""ValidModel and ApplyModel". This model is used to calculate the performance of this first model on unseen data, performance given by thePerformance (Regression)operator.
However the model delivered at the outputmodof theSplit Validation(called "TrainModel" in your case) operator is built with 100% of the input example set : Thus it is a different model from the first one described above. You can check this by reading the help section of theSplit Validationoperator :
"Output Model :
The training subprocess must return a model, which is trained on the input ExampleSet. Please note that the model built on the complete input ExampleSet is delivered from this port"
To sum up, the "ValidModel and ApplyModel" is built with 60 % of the input example set and the "TrainModel" is built with 100 % of the input example set. Thus this second model has a different performance from the first one because it is a different model...
Hope this helps,
Regards,
Lionel
NB : Sometimes, what you called "TrainModel" is called "Production model" (built with 100% of the data)6
Answers