Filtering out spikes from data set using Lag series and extract Macro
Hi,
Because of a really fluctuated datasets I'm trying to filter out spikes from the datasets. When I filter out the spikes I might get a better prediction. Now I'm wondering if I'm doing this right. I don't want to delete this from my dataset but they exist and when I delete those datapoints then I miss maybe valuable information
- With lag series = 1
- Calculating standard deviation.
- Generating a new variable maintainance. See picture above.
- Setting this new variable equals 0.
- Finally select the modified data.
Because of my strange outcomes I wondering if I'm doing this right. Could anyone confirm this or suggest another method?
Regards,
Maurits Freriks
The code to check for the detailed parameters. I didn't attached my datasets because this are a few different datasets but the method should be work on each of them.
<运营商激活= " true " class = "过程”兼容ibility="8.0.001" expanded="true" name="Process">
< parameter key="repository_entry" value="../data/flow ANJ Train"/>
< parameter key="attribute_name" value="A"/>
< parameter key="attribute_filter_type" value="single"/>
< parameter key="attribute" value="B"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Select the 'data' column
< parameter key="B" value="1"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Lag 'A' column for striping out spikes
<运营商激活= " true "类=“聚合”同情tibility="8.0.001" expanded="true" height="82" name="Aggregate" width="90" x="447" y="34">
< parameter key="B" value="standard_deviation"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Calculate std dev of the data, push to macro
< parameter key="macro" value="stdev"/>
< parameter key="macro_type" value="data_value"/>
< parameter key="attribute_name" value="standard_deviation(B)"/>
< parameter key="example_index" value="1"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">extract std dev value to use in Generate Attributes
< parameter key="Maintainence" value="if(B < ([B-1]-B), 1, 0)"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Create a Maintenance attribute to help filter out the days it's in maintenance mode
< parameter key="filters_entry_key" value="Maintainence.eq.0"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Select only non maintenance mode days
< parameter key="attribute_filter_type" value="single"/>
< parameter key="attribute" value="B"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Select 'A' again
< parameter key="Validation.cumulative_training" value="true,false"/>
< parameter key="SVM.kernel_gamma" value="[0.01;1;5;logarithmic]"/>
< parameter key="SVM.C" value="[0;10000;4;linear]"/>
< parameter key="Validation.training_window_width" value="[40;60;5;linear]"/>
< parameter key="Validation.training_window_step_size" value="[4;6;2;linear]"/>
< parameter key="Validation.test_window_width" value="[3;5;2;linear]"/>
< parameter key="macro" value="day_ahead"/>
< parameter key="value" value="5"/>
< parameter key="window_size" value="%{day_ahead}"/>
< parameter key="create_label" value="true"/>
< parameter key="label_attribute" value="B"/>
< parameter key="window_size" value="%{day_ahead}"/>
< parameter key="training_window_width" value="60"/>
< parameter key="training_window_step_size" value="6"/>
< parameter key="test_window_width" value="5"/>
< parameter key="horizon" value="2"/>
< parameter key="kernel_type" value="radial"/>
< parameter key="C" value="10000.0"/>
< parameter key="horizon" value="2"/>
< parameter key="filename" value="tmp"/>
< parameter key="C" value="operator.SVM.parameter.C"/>
< parameter key="Gamma" value="operator.SVM.parameter.kernel_gamma"/>
< parameter key="Training Width" value="operator.Validation.parameter.training_window_width"/>
< parameter key="Step Width" value="operator.Validation.parameter.training_window_step_size"/>
< parameter key="Testing Width" value="operator.Validation.parameter.test_window_width"/>
< parameter key="Perf" value="operator.Validation.value.performance"/>
< parameter key="Set Macro Value" value="operator.Set Macro.value.macro_value"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Optimize and store optimized model
< parameter key="repository_entry" value="../data/Thomas ott test ANJ"/>
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Store optimized model
<描述一致=“中心”颜色=“透明”有限公司lored="false" width="126">Sanity Check. Review 'A' time series against predicted 'A' time series from training data set.
< parameter key="excel_file" value="/Users/Maurits/Documents/Stage/Tests/SVM/ANJ/Output RapidMiner Thomas ott ANJ Train.xlsx"/>
Answers
Hi!
Another method would be a moving average. It might be easier to apply to your data.
Regards,
Balázs
Thanks@BalazsBarany
After importing this in my process I got errors. Could you help me out and place this in an/my process as example. I don't know what to delete and how to implement this. My knowledge about rapidminer is low.
Thanks in advance.
As I mentioned to you before, those spikes create all kinds of havoc. You need to evaluate how to use or discard that information, hence the reason why I created a Std Dev flag using the Generate Attributes operator. The idea was to create a new attribute column if the gas production exceed 1 standard deviation.
While this may or may not be the right way to do it, the underlying premise was to try to work with these 'spikes', as they seemed to have been important to you.
Hi!
If you don't post the error messages here, it's very hard to help you ;-)
Here's a simple example process. If you execute it and look at the result with a Series chart (selecting a1 and average(a1)), you'll see what it does.
Regards,
Balázs