How to train linear regression model effectively?
I'm a year 2 computer science student and I'm trying to build a linear regression model and predict house prices (AKaggle Quest). I build my model but it does not seem impressive at all.
First, I ran a process to see the attributes relation though thecorrelation matrix operatorand had a good grasp about their relationship and where should I manipulate them in the future. Then Iselect some appropriate attributesbased on a mixture of my common sense and the result from the correlation matrix. After that, I tried toimpute the missing valueswithOptimize Parametersoperator (nested with cross validation operator and k-NN) to find out the best k value. The next thing I did is detect and remove outliers.
Afterward, I wired up a cross validation operator with ensemble model inside(支持向量机+深度学习+梯度Boosted Tree + k-NN), (Linear Regression as the stack model learner).
However, the result did not seem promising. Ran a few test and the RMSE value I get was always around 26000 - 27000 which makes me think maybe my approach is wrong.
Can anyone look at my model and advice?
Attributes Relation Process
Main Process
<运营商激活= " true " class = " apply_model“compatibility="7.5.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="34">Optimise kNN, only improve by a bit though
<参数键= "激活" value = " ExpRectifier " / >
<运营商激活= " true " class = " apply_model“compatibility="7.5.001" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<运营商激活= " true " class = " apply_model“compatibility="7.5.001" expanded="true" height="82" name="Apply Model (2)" width="90" x="581" y="187">
<连接from_op = "应用模型(2)“from_port = "拉贝河lled data" to_port="result 2"/>
Answers
My initial thought is that you might have to re-look at your data set and break it up into subsets and train multiple models, make the predictions seperately and then append it into one data set.
From what I know about the RE market is that zoning is critical as well as SF and $/SF. You probably want to loop across those Zoning subsets and see if the RMSE improves or gets worse. Additionally you might need to generate a few new features like $/SF and even difference between the Year Built and Year Remodeled. The other pieces of data should be converted to Dummy Coding in the Nominal to Numerical operators. Unique Integers implies order so it can screw up your test set.
Nice optimization inside the Impute Missing Values. However, you should use a Normalize operator before the K-nn because K-nn is suspectible to scaling problems. The neat thing about RapidMiner's Cross Validation is that you can put that Normalize on the training side and use a Group models to pass the models to the testing side in order. This way the training data get's normalized first with a pre-processed modeled, the transformed data get's built by the K-nn, and then the pre-processed model gets passed to the Testing set and makes the conversion to the same mean as the training set before the k-nn model is applied and tested for performance.
The sample applies to the Nominal to Numerical conversions. I checked out your Stacking operator and I think the various algorithms could benefit from Optimization in there. For now, I would just work on the model and forget the testing set, just work to get the RSME down. You can of course optimize for the RSME too, so I would try that.