RM 9.1 feedback : Let's talk of the new Automatic Feature Engineering (FS)
lionelderkrikor
Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
Hi all,
First I wish to all readers a Merry Christmas.
现在坐comfortably because this is going to be a long post...!
1/ (Little) error of inconsistency between the chart and the table of the Feature Sets in Auto-Model :
To reproduce this error :
- Click onAuto Model
- Select theTitanicdataset
- Enable theDecision Treemodel (disable all the other models) and of course, enableAFS(with accurate option)
- Run the model
It seems that it is just a coding error which causes a "shift" between the chart and the table of feature sets.
2 /反馈RM 9.1:我的建议关于AFE小鬼lemented in Auto-Model
2.1 Plot the optimal Trade-Offs around the optimum :
"... a small diagram is better than a long speech..."
If I good understood, the selected feature set minimize the "distance " between the selected point and the origin of the chart (0,0).
So to visualize that the selected feature sets is effectivly the optimal feature set, it can be a good idea to continue to plot
the optimal Trade Offs . If I'm wrong, thank you to correct me.
2.2 Precise the definition of "fitness".
If I good understand, thefitness(displayed as IOObjectCollection at the exit of the AFE operator) is assimilated to theerror: It can be a good idea
to precise the definition of this notion. Once again, If I'm wrong thank you to correct me.
2.3 Setapply pruningand/orprepruningto FALSE for DT model (and similar models - RF...) whenAFSand/orOptimize Parametersare enabled :
I will try to describe my reasoning : From my point of view , when we choose willingly to search the optimal feature set (via AFE) and/or the best combinaison of parameters (in case of DT, the best k), we expect the best possible model by sacrificing some time (I understood that RM users are not very patient...). So it is to obtain a modeleffectivlyusing the found optimal feature set and/oreffectivly与the optimal combinaison of parameters.
Once again take the example of Titanic with DT (Automatically Optimize enabled) and AFE enabled.
After execution, RM concludes that :
- Feature set = 4
- k (max depth) = 4
与apply pruningand/orprepruning to FALSE, we obtain a model with effectivly 4 Features and a k = 4 which has an accuracy of96.53%.
By default, in Auto-Model,apply pruningand/orprepruning are set to TRUE, so in practice we obtain a simple model with k = 1 and using one feature :
and with a (relativ) bad accuracy of77,87 %so in fine , by default, we lose all interest in having realized a feature selection and / or a parametric optimization.
So my conclusion is to set for DT (and assimilated models) "apply pruning" and/or "apply prepruning" to "FALSE" when the user choose to set "Automatic Feature Selection" to "enabled".
What do you think about all these items ?
Thanks you for your patience and your listening and good luck to the RM staff for the next release of RapidMiner...
Regards,
Lionel
First I wish to all readers a Merry Christmas.
现在坐comfortably because this is going to be a long post...!
1/ (Little) error of inconsistency between the chart and the table of the Feature Sets in Auto-Model :
To reproduce this error :
- Click onAuto Model
- Select theTitanicdataset
- Enable theDecision Treemodel (disable all the other models) and of course, enableAFS(with accurate option)
- Run the model
It seems that it is just a coding error which causes a "shift" between the chart and the table of feature sets.
2 /反馈RM 9.1:我的建议关于AFE小鬼lemented in Auto-Model
2.1 Plot the optimal Trade-Offs around the optimum :
"... a small diagram is better than a long speech..."
If I good understood, the selected feature set minimize the "distance " between the selected point and the origin of the chart (0,0).
So to visualize that the selected feature sets is effectivly the optimal feature set, it can be a good idea to continue to plot
the optimal Trade Offs . If I'm wrong, thank you to correct me.
2.2 Precise the definition of "fitness".
If I good understand, thefitness(displayed as IOObjectCollection at the exit of the AFE operator) is assimilated to theerror: It can be a good idea
to precise the definition of this notion. Once again, If I'm wrong thank you to correct me.
2.3 Setapply pruningand/orprepruningto FALSE for DT model (and similar models - RF...) whenAFSand/orOptimize Parametersare enabled :
I will try to describe my reasoning : From my point of view , when we choose willingly to search the optimal feature set (via AFE) and/or the best combinaison of parameters (in case of DT, the best k), we expect the best possible model by sacrificing some time (I understood that RM users are not very patient...). So it is to obtain a modeleffectivlyusing the found optimal feature set and/oreffectivly与the optimal combinaison of parameters.
Once again take the example of Titanic with DT (Automatically Optimize enabled) and AFE enabled.
After execution, RM concludes that :
- Feature set = 4
- k (max depth) = 4
与apply pruningand/orprepruning to FALSE, we obtain a model with effectivly 4 Features and a k = 4 which has an accuracy of96.53%.
By default, in Auto-Model,apply pruningand/orprepruning are set to TRUE, so in practice we obtain a simple model with k = 1 and using one feature :
and with a (relativ) bad accuracy of77,87 %so in fine , by default, we lose all interest in having realized a feature selection and / or a parametric optimization.
So my conclusion is to set for DT (and assimilated models) "apply pruning" and/or "apply prepruning" to "FALSE" when the user choose to set "Automatic Feature Selection" to "enabled".
What do you think about all these items ?
Thanks you for your patience and your listening and good luck to the RM staff for the next release of RapidMiner...
Regards,
Lionel
1
Best Answers
-
sgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM ModeratorPosts:2959年Community Managerhello@lionelderkrikor- wow what an in-depth post! I'm going to leave the analysis of this to@IngoRMas this is all Auto Model. All I'm going to do is tag this as "Bug Report" and "Feature Request".
Happy holidays to you as well!
Scott
5 -
IngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM FounderFantastic feedback - thanks for taking the time and writing this down. Let me comment on some of the items already now (but rest assured that we will look into all of the items in depth as well).
1) "It seems that it is just a coding error which causes a "shift" between the chart and the table of feature sets."Yes, indeed. No clue how that happened - we have tested this 1,000s of times and this never happened. It indeed does not happen if you deselect the "Life Boat" column which we actually do as part of the test script. Weird - we will look into this of course and will fix this.
2.1) "Plot the optimal Trade-Offs around the optimum"I get where you are coming from. This does not make a lot of sense though since there no other optimal trade-offs to be plotted (this is the whole concept of the Pareto front that no point is actually better than the others). There are not other points "north" of the top left point (since adding more complexity does no longer need to better models) and going with points to the top right are no longer Pareto-optimal and are therefore removed during the optimization run.
"If I good understood, the selected feature set minimize the "distance " between the selected point and the origin of the chart (0,0)."Yes and no. The optimization indeed tries to move the whole front to the origin, but it is not like the point on the front with minimal distance is the best and will be selected at the end. In fact, there is no "best" point in the front (see above) - all points are equal in the sense that they represent the trade-off between the two competing criteria. See here for some basic information about multi-objective optimization and Pareto fronts:https://en.wikipedia.org/wiki/Pareto_efficiency(in case it helps, otherwise pls ignore...)
2.2) "Precise the definition of fitness"Yes, we use "fitness" here since you could go with different types of error measurement (think unsupervised learning). Now thinking a bit more about this, the choice of words here is poor in any case since typically the fitness is something you want to maximize but we actually want to minimize it here. So we will probably "rebrand" this to "error" for classification / regression tasks and "DB-Index" for unsupervised tasks (not supported yet but will come in 2019).
2.3) "Setapply pruningand/orprepruningto FALSE for DT model (and similar models - RF...) whenAFSand/orOptimize Parametersare enabled"Good thinking here. We will check this for the next release.
Thanks again for the fantastic feedback, very helpful indeed!Best,
Ingo
5
Answers
Thank you for taking the time to respond.
By the way, your webinar about "Feature Engineering" convinced me that this step is decisive in the methodology of a data-science project.
Regards,
Lionel
Ingo
Ingo