RM 9.1 feedback : Let's talk of the new Automatic Feature Engineering (FS)

lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
edited June 2019 inHelp
Hi all,

First I wish to all readers a Merry Christmas.
现在坐comfortably because this is going to be a long post...!

1/ (Little) error of inconsistency between the chart and the table of the Feature Sets in Auto-Model :





To reproduce this error :
- Click onAuto Model
- Select theTitanicdataset
- Enable theDecision Treemodel (disable all the other models) and of course, enableAFS(with accurate option)
- Run the model

It seems that it is just a coding error which causes a "shift" between the chart and the table of feature sets.

2 /反馈RM 9.1:我的建议关于AFE小鬼lemented in Auto-Model

2.1 Plot the optimal Trade-Offs around the optimum :
"... a small diagram is better than a long speech..."
If I good understood, the selected feature set minimize the "distance " between the selected point and the origin of the chart (0,0).
So to visualize that the selected feature sets is effectivly the optimal feature set, it can be a good idea to continue to plot
the optimal Trade Offs . If I'm wrong, thank you to correct me.


2.2 Precise the definition of "fitness".
If I good understand, thefitness(displayed as IOObjectCollection at the exit of the AFE operator) is assimilated to theerror: It can be a good idea
to precise the definition of this notion. Once again, If I'm wrong thank you to correct me.

2.3 Setapply pruningand/orprepruningto FALSE for DT model (and similar models - RF...) whenAFSand/orOptimize Parametersare enabled :
I will try to describe my reasoning : From my point of view , when we choose willingly to search the optimal feature set (via AFE) and/or the best combinaison of parameters (in case of DT, the best k), we expect the best possible model by sacrificing some time (I understood that RM users are not very patient...). So it is to obtain a modeleffectivlyusing the found optimal feature set and/oreffectivly与the optimal combinaison of parameters.
Once again take the example of Titanic with DT (Automatically Optimize enabled) and AFE enabled.
After execution, RM concludes that :
- Feature set = 4
- k (max depth) = 4
apply pruningand/orprepruning to FALSE, we obtain a model with effectivly 4 Features and a k = 4 which has an accuracy of96.53%.
By default, in Auto-Model,apply pruningand/orprepruning are set to TRUE, so in practice we obtain a simple model with k = 1 and using one feature :


and with a (relativ) bad accuracy of77,87 %so in fine , by default, we lose all interest in having realized a feature selection and / or a parametric optimization.
So my conclusion is to set for DT (and assimilated models) "apply pruning" and/or "apply prepruning" to "FALSE" when the user choose to set "Automatic Feature Selection" to "enabled".

What do you think about all these items ?

Thanks you for your patience and your listening and good luck to the RM staff for the next release of RapidMiner...

Regards,

Lionel


IngoRM

Best Answers

Answers

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
    @IngoRM,

    Thank you for taking the time to respond.
    By the way, your webinar about "Feature Engineering" convinced me that this step is decisive in the methodology of a data-science project.

    Regards,

    Lionel
    IngoRM sgenzer
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Thanks. Yes, it can be very important. However, I really want to underline what I also mentioned in the webinar, that often it is not necessarily about an improvement in model accuracy but really about the reduction in complexity while maintaining the accuracy of the model. This has been a bit different in the old days (read: 10+ years ago), but today whenever people reported that they had massive accuracy improvements by doing feature selection they typically did not correctly validate but looked at the "training" error of the feature selection outcome only.
    Things are different for feature generation / extraction of course, since here you sometimes can get to a breakthrough representation allowing the ML algorithm the underlying patterns which have been hidden before. But that unfortunately does not happen for every data science project as well. But it is always worth a try if time allows...
    Anyway, many thanks again for your questions, suggestions, and the bug report. This is much appreciated.
    Best,
    Ingo
    lionelderkrikor sgenzer
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Hi again,
    We have looked into this an there was indeed a problem with the sorting of individuals which happened in certain cases only. We have applied a fix and the problem does no longer occur in our tests. But since this is a bit of a nasty problem which is difficult to spot, I would like to ask our community members to give it a try themselves in the upcoming beta release (will come soon).
    Thanks again for pointing this out & best,
    Ingo
    lionelderkrikor
Sign InorRegisterto comment.