"Linear Regression, Trend Analysis and Prediction"

MacPhotoBikerMacPhotoBiker MemberPosts:60Contributor II
edited June 2019 inHelp
Hi,

I'd like to develop a model that predicts future sales quantities based on historic values.

Example

Let's say for Article A I know the weekly sales quantities:

Week 1: 10 pcs.
Week 2: 20 pcs.
Week 3: 30 pcs.
Week 4: 40 pcs.
[...]
Week 20: 200 pcs.

I would like to calculate the trend line and predict the consumption for the next 4 weeks, which in this simple example would be:

Prediction week 21: 210 pcs.
[...]
Prediction week 24: 240 pcs.

In Excel, one would create a chart and insert a trend line, showing also the formula for the trendline. Based on the formula one can calculate the predicted values along the trend line.

I know how I could calculate it manually (as shown here:
http://www.wired.com/wiredscience/2011/01/linear-regression-by-hand/in a "Generate Attribute" operator), but it looks cumbersome and I believe there has to be a simpler way.

I checked out the "Linear Regression" operator, but it does not seem to be what I'm looking for.

Simply put, I'm trying to calculate m and b in the famous formula "y = mx + b"

I'd be very thankful for any advice.
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, MemberPosts:1,869Unicorn
    Just declare the quantity as label and learn a linear regression on it. Connect the mod output of the operator to the process output and you'll be presented the formula.

    Best regards,
    Marius
  • MacPhotoBikerMacPhotoBiker MemberPosts:60Contributor II
    Hi Marius,

    thank you very much for your reply. I did as you suggested, and indeed I got the formula. Is there a way to use the parameters for further processing? What I try to achieve is this:

    I'd like to calculate the weekly sales for the last 52 weeks for each article, then calculate the linear regression. Then, I want to

    a) predict sales for the next week (or 4 weeks)
    b) find articles with strongest in- or decrease.

    The generated table should look like this:
    Customer Week 0 Week 1 Week 2 Week n Week 52 Parameter m Parameter b Predicted Week 53
    Customer A 0 100 200 nnn 520 10 0 530
    我想为所有文章生成此表。Such table would help to both predict future sales, and to group articles by development (e. g. "increasing" and "decreasing".

    Thanks again for your advice!
  • rowan_growan_g MemberPosts:47Contributor II
    Hi MacPhotoBiker,

    Did you manage to solve this one?

    Cheers,
  • MacPhotoBikerMacPhotoBiker MemberPosts:60Contributor II
    Hi rowan.g,

    well, I did not find an operator that would allow me to do this, I will now simply create an attribute myself, and write a formula manually. It's a bit cumbersome, but actually not too complicated. I've done this before in MS Access creating a VBA function, following this pattern:

    http://www.google.ca/imgres?imgurl=&imgrefurl=http%3A%2F%2Fwww.docstoc.com%2Fdocs%2F46599202%2FSummary-formula-sheet-for-simple-linear-regression&h=0&w=0&sz=1&tbnid=9K_TXEI_eOUcKM&tbnh=255&tbnw=197&zoom=1&docid=bUSnrnOKeCjASM&ei=JdnNUZPXJMHXygGw-ICQBg&ved=0CAEQsCU

    One of the problems I encountered so far though is this: I want to calculate the linear regression based on weekly sales. My invoicing information shows the invoicing date, and it's simple to calculate the corresponding calendar week.

    However, for invoicing dates December 30 and 31, 2012, the result is week 1 in the following year 2013. It took me a while to understand that, but when one looks at a weekly calendar, this is how it's actually being shown. Just like the before mentioned dates: They do not belong to week 52, and there's no week 53, so they have to be part of week 1 of the following year.

    However, it just feels wrong to me to say sales of 31st of december belongs to week 1. I didn't work it out yet, but probably what I will do is that I'll manually "force" all last days of december that "officially" belong to week 1 into the last week of december anyway. (Something like "If month = December and week = 1, then week = 52" and year = year-1)

    If you are working on something similar, just let me know, maybe we can share some results - and some work;)
  • budbjamesbudbjames MemberPosts:1Contributor I
    The operator that you seek is called the "Windowing" operator.

    I have a process laid out as follows:

    读数据库(包含股票市场数据)- >设置Role (I choose my label for prediction) -> Windowing (This is the key) -> Neural Network -> Apply Model -> Both outputs of apply model are connected to the results ports of the process.

    The windowing operator is a little tricky, however, the operator will allow the model to iterate through your dataset and sample a "window" of records. So I've setup the Windowing operator to look at 1 example, and increment by one example. This will effectively make it read the first example, then the second.. etc.. There is also a setting where you can specify the number of records in which you want to predict after the window.

    I will provide the code of my process in case my explanation of the Windowing operator is not clear.

    From the help:
    Synopsis
    Creates examples from a multivariate value series data set by windowing the input data.

    Description
    This operator transforms a given example set containing series data into a new example set containing single valued examples. For this purpose, windows with a specified window and step size are moved across the series and the attribute value lying horizon values after the window end is used as label which should be predicted. In contrast to the Series2WindowExamples operator, this operator can also handle multivariate series data. In order to specify the dimension which should be predicted, one must use the parameter "label_dimension" (counting starts at 0). If you want to predict all dimensions of your multivariate series you must setup several process definitions with different label dimensions, one for each dimension.
    The series data must be given as ExampleSet. The parameter "series_representation" defines how the series data is represented by the ExampleSet:
    encode_series_by_examples
    the series index variable (e.g. time) is encoded by the examples, i.e. there is a set of attributes (one for each dimension of the multivariate series) and a set of examples. Each example encodes the value vector for a new time point, each attribute value represents another dimension of the multivariate series. * encode_series_by_attributes
    the series index variable (e.g. time) is encoded by the attributes, i.e. there is a set of examples (one for each dimension of the multivariate series) and a set of attributes. The set of attribute values for all examples encodes the value vector for a new time point, each example represents another dimension of the multivariate series.
    Please note that the encoding as examples is usually more efficient with respect to the memory usage.

    Input
    example set input: expects: ExampleSetMetaData: #examples: = 0; #attributes: 0
    , expects: ExampleSet, expects: ExampleSet

    Output
    example set output:
    original:

    Parameters
    series representation: This parameter defines how the series values will be represented. Range: encode_series_by_examples, encode_series_by_attributes; default: encode_series_by_examples
    horizon: The prediction horizon, i.e. the distance between the last window value and the value to predict. Range: integer; 0-+?; default: 0
    window size: The width of the used windows. Range: integer; 1-+?; default: 100
    step size: The step size of the used windows, i.e. the distance between the first values Range: integer; 1-+?; default: 1
    create single attributes: Indicates if the result example set should use single attributes instead of series attributes. Range: boolean; default: true
    create label: If checked a label will be created Range: boolean; default: false
    select label by dimension: If checked the label will be created using an index specifying the attribute Range: boolean; default: false
    label attribute: The name of the attribute which should be used for creating the label values. Range: string
    label dimension: The dimension which should be used for creating the label values (counting starts with 0). Range: integer; 0-+?
    add incomplete windows: Create windows for all examples, which causes missings in the first windows. Range: boolean; default: false


































    <连接from_op =“神经网络”from_port =“模型”_op="Apply Model" to_port="model"/>










    I hope this helps.

    Please check out my blogs for more RapidMiner and data mining tutorials.
    http://completebusinessanalytics.com/
    http://refactorthis.net/
  • rowan_growan_g MemberPosts:47Contributor II
    Thanks for that budbjames. It worked pretty well. Windowing was exactly what I was looking for.
  • MacPhotoBikerMacPhotoBiker MemberPosts:60Contributor II
    Hi budbjames,

    first of all, please accept my apology for checking back so late! I somehow missed the "notify" function, and it also slipped a bit my mind to check back - sorry about that!

    I do appreciate your information, and that you even put together a model.

    I will check it out shortly, and provide some feedback here.

    Again, sorry for my long absence!
Sign InorRegisterto comment.