Time Series Forecasting for many examples
Hi All-
[Apologies in advance for any confusing or vague language I may use; I'm not a data scientist, so I don't know the proper terminology.]
Say I have a data set of sales volume over time for a retailer that sells screwdrivers. Their product catalog really runs the gamut: flathead, phillips, torx, long, short, every color you can think of, and on and on. If you wanted to forecast demand, you could create a model for one series at a time for each product (e.g. short, yellow, flathead screwdrivers and then medium length, purple, torx drivers with fat handles, etc), or one could aggregate sales for all phillips head screwdrivers or all the different types of screwdrivers in order to collapse them into one series.
For some reason, though, let's say you wanted to use all the data from every type of screwdrivers individually to train a model. For each date, you would have data points for every type of screwdriver in inventory.
What is the "right way" to represent this in RapidMiner?
@sgenzer @tftemme
[Apologies in advance for any confusing or vague language I may use; I'm not a data scientist, so I don't know the proper terminology.]
Say I have a data set of sales volume over time for a retailer that sells screwdrivers. Their product catalog really runs the gamut: flathead, phillips, torx, long, short, every color you can think of, and on and on. If you wanted to forecast demand, you could create a model for one series at a time for each product (e.g. short, yellow, flathead screwdrivers and then medium length, purple, torx drivers with fat handles, etc), or one could aggregate sales for all phillips head screwdrivers or all the different types of screwdrivers in order to collapse them into one series.
For some reason, though, let's say you wanted to use all the data from every type of screwdrivers individually to train a model. For each date, you would have data points for every type of screwdriver in inventory.
What is the "right way" to represent this in RapidMiner?
@sgenzer @tftemme
Tagged:
0
Best Answers
-
Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,635UnicornIf you don't want to aggregate any sales data, then aren't you back to forecasting sales for each individual item (or at whatever level of granularity your data currently exists)? But I thought that is what you said you did not want to do in the OP. However, RapidMiner certainly will support it, you will just need to iterate through and provide a different target forecast attribute each time.
If that isn't correct, you'll probably need to post a sample data file to be more clear on what exactly it is you are trying to accomplish.
5 -
hughesfleming68 MemberPosts:323UnicornHi Noel,
I see what you are trying to do. In most cases simpler is better. Treat each ID as an independent prediction and try and determine which of your attributes actually contains any signal. Select the attribute that you feel is contributing the most and with a series of joins, build a table that consists of your assets and one windowed attribute and run that through your cross validation. A real world example would be using data from sector ETF's to predict overall market direction. Remember to set your cross validation to linear sampling or better still, use a sliding window validation. Also take a look at your normalization. If you normalize first and then combine your assets, you will lose the relationship between them as you put them on the same scale. You might want to do this but there are cases where you might not.
I am not sure that combining the attributes the way your are suggesting will give you the results you are looking for. Working up from the simplest model is always the best as it is already hard to separate signal from noise.
请注意,分化来实现a stationary time series may actually result in over differentiation. A partial solution is to use fractional differentiation and the Augmented Dicky-Fuller test and estimate how much differentiation is actually necessary to achieve a stationary time series. This may or may not be necessary but it is worth investigating if it gives you better results. PM me if you would like the Python code to test this. Rather than using ADF tests, I prefer to set a loop of values for the fractional differentiation and see what effects it has on my prediction. Rapidminer is great for this kind of testing.
Regards,
Alex3
Answers
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
假设我有每天销售编号为每一种滋味of screwdriver carried by the retailer and do not want to aggregate the data into a single time series. (Unfortunately, my ability to use this analogy breaks down here because I can't think of why one would want to model sales of each type of screwdriver individually instead of aggregating them... I can try to come up with another analogy if it is helpful, but I'm afraid it will muddy the water.)
I have time series of five assets' prices and the values of the index in which they belong. I want to train a model on all five individually and forecast a value for the index one period into the future.
Intuitively, it feels like I should iterate over the five assets, window their attributes, and "feed" them to the model one at a time. The first window for Asset #1 would look something like:
and you'd do that four more times for Assets 2-5. I can't get this method to work, though, and as a novice in machine learning, I'm not even sure it makes sense. Joining all the assets' data together for each date also comes to mind:
Asset #1 Px - 2 | Asset #1 Px - 1 | Asset #1 Px - 0 | Asset #2 Px - 2 | Asset #2 Px - 1 | Asset #2 Px - 0 | ... other Assets, Index vals....
but with a lot of assets that have many attributes and potentially wide windows, I could see that getting out of hand.
我最终looping through all the asset IDs, windowing each series, and appending the results one after another. The data set going into the model looks like: Asset #1 windowed data, Asset #2 windowed data, ... Asset #5 windowed data
Does this "scramble" the chronology? Does the model know it is seeing five examples for the same time frame and index values?
Any help would be greatly appreciated! (process and data attached).
Best,
Noel
@sgenzer @Telcontar120 @Pavithra_Rao @CraigBostonUSA @hughesfleming68
Scott