Classification and feature construction on Time series Data
Hello everyone,
As part of a case study, I 've been working on the task 'Time series Classification' and the goal is to classify the time series data (each example in the dataset represents a time series) into 7 different classes. With the basic process( K-NN with Dynamic Time Warping) I got the classification accuracy of 98.93 and RMSE 0.011 +/0.103 ( which is strange). Since I am new to time series classification, I built a simple process without any feature construction.
So I would like to have your comments on the processes that I have built and about the various feature engineering(preprocessing) techniques and the operators in RapidMiner that I can apply on time series data (each example represents a time series) for classification
I have attached the sample data and the XML of the process. Please review the process and the data, and it would be great if you can let me know the right way to deal with the time series( each example in the dataset) data for the classification task with RapidMiner.
About the dataset:
*. Each example ( each row) represents a time series and have 34 regular attributes(features) which represent the different periods of the time series.
*. The class labels Type have 7 different classes(1,2,..7). see below picture
Your comments are valuable,
Many thanks and best regards,
Surya
< parameter key="logverbosity" value="init"/>
< parameter key="random_seed" value="2001"/>
< parameter key="send_mail" value="never"/>
< parameter key="notification_email" value=""/>
< parameter key="process_duration_for_mail" value="30"/>
< parameter key="encoding" value="SYSTEM"/>
< process expanded="true">
< parameter key="repository_entry" value="../data/Classfication_timeseries_with classnames"/>
< parameter key="ratio" value="0.8"/>
< parameter key="ratio" value="0.2"/>
< parameter key="sampling_type" value="automatic"/>
< parameter key="use_local_random_seed" value="false"/>
< parameter key="local_random_seed" value="1992"/>
< parameter key="attribute_filter_type" value="single"/>
< parameter key="attribute" value="Type"/>
< parameter key="attributes" value=""/>
< parameter key="use_except_expression" value="false"/>
< parameter key="value_type" value="attribute_value"/>
< parameter key="use_value_type_exception" value="false"/>
< parameter key="except_value_type" value="time"/>
< parameter key="block_type" value="attribute_block"/>
< parameter key="use_block_type_exception" value="false"/>
< parameter key="except_block_type" value="value_matrix_row_start"/>
< parameter key="invert_selection" value="true"/>
< parameter key="include_special_attributes" value="true"/>
< parameter key="create_complete_model" value="false"/>
< parameter key="training_window_width" value="10"/>
< parameter key="training_window_step_size" value="-1"/>
< parameter key="test_window_width" value="10"/>
< parameter key="horizon" value="1"/>
< parameter key="cumulative_training" value="false"/>
< parameter key="average_performances_only" value="true"/>
< process expanded="true">
< parameter key="k" value="1"/>
< parameter key="weighted_vote" value="false"/>
< parameter key="measure_types" value="NumericalMeasures"/>
< parameter key="mixed_measure" value="MixedEuclideanDistance"/>
< parameter key="nominal_measure" value="NominalDistance"/>
< parameter key="numerical_measure" value="DynamicTimeWarpingDistance"/>
< parameter key="divergence" value="GeneralizedIDivergence"/>
< parameter key="kernel_type" value="radial"/>
< parameter key="kernel_gamma" value="1.0"/>
< parameter key="kernel_sigma1" value="1.0"/>
< parameter key="kernel_sigma2" value="0.0"/>
< parameter key="kernel_sigma3" value="2.0"/>
< parameter key="kernel_degree" value="3.0"/>
< parameter key="kernel_shift" value="1.0"/>
< parameter key="kernel_a" value="1.0"/>
< parameter key="kernel_b" value="0.0"/>
< portSpacing port="source_training" spacing="0"/>
< portSpacing port="sink_model" spacing="0"/>
< portSpacing port="sink_through 1" spacing="0"/>
< process expanded="true">
< parameter key="create_view" value="false"/>
< parameter key="main_criterion" value="first"/>
< parameter key="accuracy" value="true"/>
< parameter key="classification_error" value="false"/>
< parameter key="kappa" value="false"/>
< parameter key="weighted_mean_recall" value="false"/>
< parameter key="weighted_mean_precision" value="false"/>
< parameter key="spearman_rho" value="false"/>
< parameter key="kendall_tau" value="false"/>
< parameter key="absolute_error" value="false"/>
< parameter key="relative_error" value="false"/>
< parameter key="relative_error_lenient" value="false"/>
< parameter key="relative_error_strict" value="false"/>
< parameter key="normalized_absolute_error" value="false"/>
< parameter key="root_mean_squared_error" value="true"/>
< parameter key="root_relative_squared_error" value="false"/>
< parameter key="squared_error" value="false"/>
< parameter key="correlation" value="false"/>
< parameter key="squared_correlation" value="false"/>
< parameter key="cross-entropy" value="false"/>
< parameter key="margin" value="false"/>
< parameter key="soft_margin_loss" value="false"/>
< parameter key="logistic_loss" value="false"/>
< parameter key="skip_undefined_labels" value="true"/>
< parameter key="use_example_weights" value="true"/>
< portSpacing port="source_model" spacing="0"/>
< portSpacing port="source_test set" spacing="0"/>
< portSpacing port="source_through 1" spacing="0"/>
< portSpacing port="sink_averagable 1" spacing="0"/>
< portSpacing port="sink_averagable 2" spacing="0"/>
< parameter key="create_view" value="false"/>
< portSpacing port="source_input 1" spacing="0"/>
< portSpacing port="sink_result 1" spacing="0"/>
< portSpacing port="sink_result 2" spacing="0"/>
< portSpacing port="sink_result 3" spacing="0"/>
< portSpacing port="sink_result 4" spacing="0"/>
Answers
Hi@surya_mpad,
You are using aSliding Window Validationoperator, which is used, in deed, in time series problems.
But a priori your problem is a pure classification problem : you want to predict the class of the attribute "Type" according to the values
of your attributes Period-i, right ?
So you have to use aCross Validationoperator associated to aPerformance (Classification)operator.
I don't how you obtain an accuracy of 98.93 % (on the whole dataset ? / have you set "Product Id" as "id" usingSet Role?), this high result is suspect.
To answer to your question about feature selection, in deed, you have a lot of attributes. So to reduce the number of these attributes (without losing precision), and thus gain in simplicity, you can useOptimize Selection (Evolutionnary)operator (documentation about this algorithmhere).
On my side, on your partial dataset, I obain with thekNNmodel :
with Optimize Selection without Optimize Selection
k = 195%89%
k = 2 88%89%
k = 3 89% 89%
...
You can find my process here :
I hope it helps,
Regards,
Lionel
@surya_mpadI want to add that your time series data appears to have very low activity and then a sudden spike in volatility. Can you account for this?
Hi Thomas,
Thanks for the reply.
I think the time series what you have drawn is from one attribute(period_1.0).
As I have mentioned in my post, each example represents a time series, and the task is to classify them into categories( attribute 'type' is the label). So I think we need to analyses time series on each example( please correct me if I am wrong).
And please remember that the data generated with a script, so the data might irregular.
Many Thanks
Surya