"Suggested way to cluster time-series"
Hello all,
I have this problem at hand: a flat file with three columns (id, date, efficiency), actually measuring the fuel consumption of engines, on which I need to perform clustering of the IDs to discover if groups of engine types or engines operated by the same person follow similar patterns. I am facing three major obstacles and I hope someone can help me:
1. The CSV file is big (around 20mil rows). Importing it with the date field as polynomial takes about 20sec but using date as the type hangs the application (without running out of memory) after 4-5min. I believe using a database might solve that, it is strange though. I need the date to be of "date" type so that I can calculate correlation (sample dates are different for each id)
2. I need to split the 20mil rows in order to create a time series for each id. The unique IDs are around 5000, making it impossible to do by hand.
3. I need, for the time series of a given ID, to calculate the correlation coefficients with all other IDs,
To do step 3, I will first perform some kind of temporal abstraction to convert "date" to "days after service" and "efficiency" to "% of efficiency in day 0".
Then, if needed, I can use an interpolation algorithm (which I have, and is quite accurate) to create data for the same days on two given series (say ID1 has valuse for days 1,2,5,7,9,20 and ID2 has days 2,5,8,20 I will create fake values for ID2 days 1,7,9 and drop day 8).
I can do that for any two IDs (provided I have completed step 2 and have separate datasets). But how can I do it in batch? I could use 5000 "day" attributes but how would I construct the dataset from the initial flat file? And still, would that be the optimal solution?
I have this problem at hand: a flat file with three columns (id, date, efficiency), actually measuring the fuel consumption of engines, on which I need to perform clustering of the IDs to discover if groups of engine types or engines operated by the same person follow similar patterns. I am facing three major obstacles and I hope someone can help me:
1. The CSV file is big (around 20mil rows). Importing it with the date field as polynomial takes about 20sec but using date as the type hangs the application (without running out of memory) after 4-5min. I believe using a database might solve that, it is strange though. I need the date to be of "date" type so that I can calculate correlation (sample dates are different for each id)
2. I need to split the 20mil rows in order to create a time series for each id. The unique IDs are around 5000, making it impossible to do by hand.
3. I need, for the time series of a given ID, to calculate the correlation coefficients with all other IDs,
To do step 3, I will first perform some kind of temporal abstraction to convert "date" to "days after service" and "efficiency" to "% of efficiency in day 0".
Then, if needed, I can use an interpolation algorithm (which I have, and is quite accurate) to create data for the same days on two given series (say ID1 has valuse for days 1,2,5,7,9,20 and ID2 has days 2,5,8,20 I will create fake values for ID2 days 1,7,9 and drop day 8).
I can do that for any two IDs (provided I have completed step 2 and have separate datasets). But how can I do it in batch? I could use 5000 "day" attributes but how would I construct the dataset from the initial flat file? And still, would that be the optimal solution?
Tagged:
0
Answers
If I understand correctly the task is to predict the ID based on some history of efficiency measurements.
Best regards,
Wessel
You would need quite a large number of rows to understand the problem.
The format is (id, date, eff). I don't want to predict a single ID. I want to cluster the IDs, to note any similar patterns.
If I reformat the data as (day, eff1, eff2, eff3, ...) I would look at the correlation matrix of eff1...effn.
Day would be (date - service date) or (date - service date)/service interval, to get either days after service or % of time between services. Operating time is almost the same for all, so there is no problem of operating hours vs calendar days.
My problem is how to split the data into 5000 time series (date,eff)
Then I either need to train a model with the 5000 different time series as input (which I don't know how to do)
or run the temporal abstraction to convert date to day, run the interpolation to have the same values for "day" in all time series, rejoin to get one dataset with 5000 attributes and get the correlation matrix (which I know how to do, not in rapidminer though, I would probably do it on an ETL application like kettle).
I certainly would prefer a way to run a clustering model with the input being a number of time series instead of rows but I don't know how to do it or if it is even possible. With enough attributes the time series can be a row but still, how would I convert (id, date, eff) to (id, eff-day1, eff-day2, eff-day3...) in rapidminer?
Some kind of windowing but only after running the abstraction and interpolation.
Or maybe having twice the attributes like (id, date1, eff1, date2, eff2, date3, eff3...) and then convert date to day.
For now, lets skip the discussion whether this is a classification or a clustering problem.
Let us try and generate some fake data in order to get the problem clear:
Probably your data-set will look something like this.
ID, date, eff
1, x1, y1
1, x2, y2
1, x3, y3
1, x4, y4
2, x5, y5
2, x6, y6
2, x7, y7
3, x8, y8
3, x9, y9
3, x10, y11
3, x11, y12
3, x12, y13
3, x13, y14
So for each ID, you have a variable number of efficiency measurements, and these measurements are not equally spaced out in time.
If each ID has at least 3 measurements you can make training examples containing 3 consecutive measurements.
For ID's for which there are more then 3 measurements, you are able to generate more training examples.
This would convert the fake data-set as follows.
ID, eff-0, eff-1, eff-2
1, x1, x2, x3
1, x2, x3, x4
2, x5, x6, x7
3, x8, y9, x10
3, x9, x10, x11
3, x10, x11, x12
3. x11, x12, x13
From your description I could not derive how to best utilize the attribute "date".
Maybe its a good idea to augment our data set with the difference in hours between two consecutive measurements?
The final data set would then look something like:
ID, eff-0, eff-1, eff-2, diff-0, diff-1,
1, x1, x2, x3, y1-y2, y2-y3
1, x2, x3, x4, y2-y3, y3-y4
2, x5, x6, x7, y5-y6, y6-y7
3, x8, y9, x10, y8-y9, y9-y10
3, x9, x10, x11, y9-y10, y10-y11
3, x10, x11, x12, y10-y11, y11-y12
3. x11, x12, x13, y11-y12, y12-y13
On this data-set you can run any clustering or classification algorithm.
I have a few issues on the solution you suggested.
You chose to have 3 days as columns. Is there any particular reason not to take 4000? Will the clustering result be the same or approximately the same?
As for the dataset transformation, the final dataset should look more like:
ID, day-0, day-1, day-2, eff-0, eff-1, eff-2
1, 0, x2-x1, x3-x1, 1, y2/y1, y3/y1
1, x2-x1, x3-x1, x4-x1, y2/y1, y3/y1, y4/y1
2, 0, x6-x5, x7-x5, 1, y6/y5, y7/y5
Since I am very new to rapidminer, I am having some issues on how to create these attributes.
How would I specify that the generated attribute would be x-x1 for id=1 and x-x5 for id=2?
Some combination of loop, filter and append I believe. How can the first x and y of each set (after filtered and sorted by x) be "stored" somewhere to be used at the generate attibute operator?
And after that a windowing operator.
How can I make the windowing operator split the row (and insert a missing value) if an ID has 5 instead of 6 days of data (or any non multiple of 3)?
The choice for embedding distance: '3' or '4000', depends on your expert domain knowledge and the availability of the data.
You need to choice an embedding distance such that it can capture important patterns, and at the same time generate enough training examples.
Furthermore once you made an estimate guess, you can rerun your entire experiment later with a different embedding distance value.
There are many different ways to make the data.
I would use the "Loop parameters" operator with "Filter examples" inside, to create datasets containing only a examples with a single ID.
然后我将使用窗口ing operator.
And then I would use the generate attributes operator.
Best regards,
Wessel
你真的非常有帮助。这个项目罗oks like it's on a way to success, thanks to your valuable assistance.