ratio training/testing in sliding windows validation

maurits_freriksmaurits_freriks MemberPosts:28Contributor I
edited December 2018 inHelp

Hi all,

I'm already months begin struggling with a problem about prediction. Now after a few optimalization runs (takes days and days) I advanced (probably) with the problem overfitting?!

As you can see in the picture below the 6th column represent the performance of the model. The 3rd, 4th an 5th are the parameters of the sliding windows validation(training width, step width, testing width). Probably the ratio training vs testing is too high. But if I decrease the ratio the performance will decrease. So I don't know what the perfect ratio will be such that the perfromance is not suspect anymore.

So could anyone advice me what the ratio is in respect to my datasets:

https://drive.google.com/open?id=12XjPKw2diSLnc9-MtAv_--SVfntA3nR-

Screen Shot 2018-01-16 at 20.21.34.png

Below the XML code of the proces. I used the score object to combine these values against my test set in a score process.















Select the 'A' column





Lag 'A' column for striping out spikes





Calculate std dev of 'A', push to macro






<列出关键= " additional_macros " / >
extract std dev value to use in Generate Attributes





Create a Maintenance attribute to help filter out the days it's in maintenance mode





Select only non maintenance mode days




Select 'A' again


























<操作符= " true " class = " support_vector_m激活achine" compatibility="7.6.001" expanded="true" height="124" name="SVM" width="90" x="112" y="34">


<参数键= " C " value = " 9000.0 " / >



< portSpacing port="source_training" spacing="0"/>
< portSpacing port="sink_model" spacing="0"/>
< portSpacing port="sink_through 1" spacing="0"/>












< portSpacing port="source_model" spacing="0"/>
< portSpacing port="source_test set" spacing="0"/>
< portSpacing port="source_through 1" spacing="0"/>
< portSpacing port="sink_averagable 1" spacing="0"/>
< portSpacing port="sink_averagable 2" spacing="0"/>

























< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="source_input 2" spacing="0"/>
< portSpacing port="sink_performance" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>
< portSpacing port="sink_result 3" spacing="0"/>

Optimize and store optimized model



Store optimized model



Sanity Check. Review 'A' time series against predicted 'A' time series from training data set.















< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>
< portSpacing port="sink_result 3" spacing="0"/>
< portSpacing port="sink_result 4" spacing="0"/>


Screen Shot 2018-01-16 at 23.20.38.png

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi@maurits_freriks,

    For the ratio, I would say training width = 0,7 / 0,8 and respectivly test width = 0,3 / 0,2 with increased absolute value of

    test width (test width = 5 is too low from my opinion).

    Alternatively, how said in PM, you can use theRMSE-performance(regression)operator - to measure in a more objective way the performance of your model(s).

    Best regards,

    Lionel

  • maurits_freriksmaurits_freriks MemberPosts:28Contributor I

    @lionelderkrikor

    You mean change the performance (forecasting performance) into performance (regression)?

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn

    Hi Maurits,

    Exactly. The best model is the one that minimizes RMSE.

    Best regards,

    Lionel

    sgenzer
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    Yep, RSME is definately another way to look at this. My main concern has been those spikes lower. Can they be removed or is there a specific reason that they must remain in?

    sgenzer
  • maurits_freriksmaurits_freriks MemberPosts:28Contributor I

    @Thomas_Ott

    Sorry for the late reply.

    Yes there is a specific reason why those spikes are in the dataset. Becaus this was the actual flow of the days in the past. The reason of this spikes is because of maintainance (planning) or tripping (Unpreditable). The final goal is to automatise the prediction process so you have to pay attention to those spikes. Now I do have a planning_dump where you could find what happends in the spikes.

    @Thomas_OttCould I sent you a PM such that you could think about how to implet this into a Rapid Miner Process?

    With kind regards,

    Maurits Freriks

    With kind regards

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    @maurits_freriksMy suggestion is to ask your question in the community. I'm very crunched for time this week and won't be able to look at anything.

Sign InorRegisterto comment.