Improving Test & Out of Sample Perf with Opt Selection and Auto Feature Generation

NoelNoel MemberPosts:82Maven
edited September 2019 inHelp

Apologies for cc’ing everyone, but I really need some help!

I have a data set which started with 15 attributes and the two calculations which were needed to create the labels in excel (the label has three distinct values, but for my purposes, I’m only interested in two of them). The data is a time series with about 1300 periods/5+ years for training and 250 periods/1 year for testing. In RapidMiner, I calculate 20 period aggregations and create windows of 10 periods.

Using the full feature set, I trained a GBT that is about 65% accurate in training:



The testing performance is “really bad”, however, less than 50% accuracy:



Note : I think I’ve attached all the relevant files: Data_Labeler_v16_help.xlsm (the labeler), help.xlsx (the data), 103b_create_AM_help.rmp (training process), and 103d_apply_AM_help.rmp (testing process).

I’ve been most focused on improving the testing accuracy and with Ingo’s “Multi-Objective Feature Selection” series, have been working with the Optimize Selection (Evolutionary) and Automatic Feature Engineering operators. AFE has worked “best” so far.

Training the GBT using the AFE feature set achieves just over 65% accuracy:



and ~55% in testing...



Headed in the right direction, but still a ways to go.

I’ve also included the AFE training process, AM_afe_help.rmp, and apply_AM_afe_help.rmp (for testing). The AFE ran for a while so I also included the feature set “features_AM_afe_help” and model “model_AM_process_afe_help”.

My question is this:how can I squeeze some more accuracy from this data (especially testing/out of sample accuracy)?任何建议都感谢…我尝试to demonstrate a win for machine learning in my firm’s area of interest, but I only have another week to do it in.

Many thanks,
Noel
help.zip 2.2M
Tghadially

Best Answers

  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    edited September 2019 Solution Accepted
    Hi Noel, I have just seen this but I will take a look. I am going to have to reread it again as well but my first thought is that your testing window might be too large and a windowing set to 10 might be too small for daily data. Are you logging your accuracy over those test periods or just looking at the average? You would expect that a model that is actually predicting something would have higher accuracy close in time to your training data and then decay. This is typical for financial time series. A random model might have no pattern. Changes in market regime can be quite significant from year to year. You may have a couple of years where everything is just working and then all of a sudden things fall apart.
    varunm1 Noel Tghadially
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Solution Accepted
    Hi,
    as a general point: If training and testing error diverge, you most likely over-trained. Either because of id-ish attributes or because of a too complex model.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
    Noel Tghadially

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2,959Community Manager
    Noel Tghadially
  • NoelNoel MemberPosts:82Maven
    @tftemme/ Fabian-

    Is it possible that the time series aspect of this data set (or the way I structured my process in terns of the GBT and sliding validation) is contributing to the disconnect between training and testing performance?

    谢谢,
    Noel

    (@IngoRM,@yyhuang,@varunm1,@hughesfleming68,@mschmitz,@sgenzer)
  • NoelNoel MemberPosts:82Maven
    One final plea for aid... (@IngoRM,@yyhuang,@varunm1,@hughesfleming68)

    I took a step back this weekend and tried to enumerate all the moving parts in my analysis:
    1. Label creation (criteria, related calculations, *alignment*)
    2. Matters relating to the TimeSeries aspect of my data (aggregation periods and types, window size, validation methodology)
    3. GBT tuning (both trees in general and boosting specifically: max depth, num trees, num bins, learning rate, min split improvement, etc.)
    4. 功能创建的ome overlap with timeseries aggregations) and selection
    I read a bunch of posts in the community and came away thinking that its best toconfigure the GBT(thank you,@mschmitz) andbe sure to have a solid validation approach in place(thank you,@Telcontar120) before focusing on feature weighting, creation, selection, etc.

    So, I covered much of #2 and #3 (see below). If anyone has any suggestions for other GBT and timeseries tweaks, please let me know.

    At this point, is it all about the features? Current results; training on top, testing on bottom (process and data attached):


    谢谢,
    Noel
    -----

    TimeSeries:I went with the basic aggregations to start (mean, median, max, min, stdev) and looked at aggregation periods and window sizes:

    Aggregation period: 6, Window size: 5



    I looked carefully at the Sliding Window Validation operator.I had been using training and testing windows of 100 with steps sizes of their combined width. I came across@sgenzer'stimeseries challengeand tried the validation settings discussed therein: cumulative training, single period test windows, multiple iterations, and none of it seemed to have any impact:



    I also did my best to nail down the GBT parameters:



    Num trees vs Depth for three learning rates (0.09, 0.10, 0.11)



    help.rmp 24.1K
    help.xlsx 931.3K
    Tghadially
  • NoelNoel MemberPosts:82Maven
    谢谢, Alex. Much appreciated!
    I'll have a look at that. Great suggestion.
    Tghadially
  • NoelNoel MemberPosts:82Maven
    Alex- while I think both of your observations were correct, I’m still not able to bridge the training and testing performance gap. If you have any more thots/suggestions, etc. I’m all ears.
    谢谢again.
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    edited October 2019
    Hi Noel, I am checking your data prep. 103b_create_AM.If I break on the first filter examples operator, it is giving me a constant label. Is this correct? It could be way too early in the morning and I need more coffee.



    Let me know if I am at least reading the right files. Usually if you feel that something really should work better, the problem is most likely some transformation on your attributes that is killing your signal by mistake.

    Any kind of feature selection risks over fitting the training data especially when the signal to noise ratio is low. It can certainly make a good base model better but watch out if it is making a really big difference. You may have to shift your data few times to see if there is consistency with regards to which attributes are being thrown out.

    What is really jumping out at me is that you are sampling down your training set to 1000 before automatic feature selection. I wouldn't do this. Try and keep the sequences in tact and remove any randomness. Try using the last 1000 samples instead.

    你的过程是复杂但仍然容易得多than digging through code. I see that you are down sampling a couple of times in your other processes and you are not using local random seed. My fear of this maybe unjustified. It might be fine to do this. I don't but that is just me. I am actually curious what other people think. Anyone?

    Alex

    Noel
  • NoelNoel MemberPosts:82Maven
    谢谢for looking under the hood, Alex. That first process just uses an exported auto model process as a base. The most recent process I uploaded is much simpler to go through. (I should have reposted the other files so the second post was self contained). The labeler and source file from the first post work with the second process. I’ll repost.
  • NoelNoel MemberPosts:82Maven
    Alex- Here's the second process. No magic. Just calculating aggregations and windowing.
    help_ii.rmp 23.5K
    help.xlsx 931.3K
  • NoelNoel MemberPosts:82Maven
    edited October 2019
    Martin /@mschmitz- For a timeseries data set, how many periods of daily data is sufficient for training? Thanks, Noel
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    edited October 2019
    Hi Noel, I had to substitute the windowing operator for the older windowing operator from the value series extension for your process to work. I will see what is going on. Could you confirm that you are getting all three classes windowed properly in your help_ii process for your label?

    谢谢
    Noel
  • NoelNoel MemberPosts:82Maven
    Alex- Strange that it didn't run out of the box. When you say properly windowing all three classes, I'm not sure what you mean. I exclude all labels but the horizon from the data so the embedded information about future does not leak through. I meant for only the numeric attributes to be aggregated and windowed.
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    edited October 2019
    Yes. I get the following error.

    It comes from the windowing operator. Substituting the value series windowing operator fixes the problem. We can continue this via private mail if you wish. I just want to make sure that I am seeing what you are seeing.

    I also had to adjust the filter examples attribute names for the data attribute.

    When it runs, I get this. Using GLM is slightly better.



    Check my adjusted version to see.


    Noel
  • NoelNoel MemberPosts:82Maven
    That sounds good (private email), Alex. I tried to DM you, but I don't think it went through.
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    I just sent you a PM.
    Tghadially
  • NoelNoel MemberPosts:82Maven
    谢谢to everyone who suffered through my posts to help out! I very much appreciate it!

    (@IngoRM,@yyhuang,@varunm1,@hughesfleming68,@mschmitz,@sgenzer @CraigBostonUSA @Pavithra_Rao)

    CraigBostonUSA hughesfleming68 Tghadially
  • tftemmetftemme Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, MemberPosts:164RM Research
    Hi@Noel

    Sorry for not responding earlier. This seems to be solved, right? I just skimmed through the thread. There seemed to be an issue with the Windowing operator and the GBT, I think@hughesfleming68is reporting about this. Is this still an issue?

    Best regards,
    Fabian
  • hughesfleming68hughesfleming68 MemberPosts:323Unicorn
    @tftemmeHi, Fabian. I have just started to use the new operators. I will try and reproduce the error later today. If I discover something, I will let you know.

    Alex
    Tghadially
  • NoelNoel MemberPosts:82Maven
    Hi Fabian /@tftemme

    There are two issues. The first has to do with GBTs and time series data. For daily data, is there a "right" amount of training that is sufficient for the task, but avoids overfit and the divergence between model testing and training performance?

    The second issue I think has to do with the core windowing operator's behavior in 9.4. It seems to change all the labels to a single value which leads to the GBT complaining about the response being constant during validation (the error@hughesfleming68reported).

    谢谢,
    Noel
Sign InorRegisterto comment.