问题Automodel (AM)
Hello, I have few questions on Automodel (AM):
1. How does "weights" (given under "General" tab) differ from "feature sets". For example, in one simulation, AM shows that a certain input has an importance of 1, however by examining feature sets in a couple of algorithms (say 4 out 7) that were selected by AM for this analysis, these 4 algorithms do not select this particular input (when I view "feature sets").
2. In "Optimal trade-offs between complexity and error" graph. I can find a model of complexity of 4 and an error of 15%. However, for this particular algorithm the accuracy was 72%. I guess I am not sure on how these two relate to each other.
3.鉴于以上,最好的方法是什么know the critical inputs in a dataset? Say that I trying to identify critical inputs in one dataset using AM and this is my thought process: what are these critical inputs for GLM, LR, DL, DT, RF, GBT etc. such that I can pinpoint identified inputs that re-occur between algorithms. I guess, this is my way of identifying such parameters (i.e. if they show up in different algorithms, then they are of high importance to the dataset). Any tips on this are appreciated. Thanks!
1. How does "weights" (given under "General" tab) differ from "feature sets". For example, in one simulation, AM shows that a certain input has an importance of 1, however by examining feature sets in a couple of algorithms (say 4 out 7) that were selected by AM for this analysis, these 4 algorithms do not select this particular input (when I view "feature sets").
2. In "Optimal trade-offs between complexity and error" graph. I can find a model of complexity of 4 and an error of 15%. However, for this particular algorithm the accuracy was 72%. I guess I am not sure on how these two relate to each other.
3.鉴于以上,最好的方法是什么know the critical inputs in a dataset? Say that I trying to identify critical inputs in one dataset using AM and this is my thought process: what are these critical inputs for GLM, LR, DL, DT, RF, GBT etc. such that I can pinpoint identified inputs that re-occur between algorithms. I guess, this is my way of identifying such parameters (i.e. if they show up in different algorithms, then they are of high importance to the dataset). Any tips on this are appreciated. Thanks!
Tagged:
0
Best Answer
-
IngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM FounderHey@mzn
Ok, here we go:1. How does "weights" (given under "General" tab) differ from "feature sets".The weights in the General tab are simply the correlations of the attributes with the label. They are independent of any modeling and just in general some guidance what likely matters more. However, interactions between attributes picked up by models often make other variables much more important for this model than those with highest correlations and sometimes a combination of variables with zero correlation can beat a single one with let's say 0.7. You can open up the process by the way to see how the data is prepared and the correlations are calculated.
The chart in "Feature Sets" on the other hand is model-specific and takes those interactions into accounts. The creation of those chart actually has been a direct output of my research and if you are really bored feel free to check out my PhD thesis here:
https://www-ai.cs.uni-dortmund.de/PublicPublicationFiles/mierswa_2008a.pdf
If you just want the highlights, I would recommend the following two webinars instead:
I also wrote a series of blog posts about this some time ago. Here is the link to the first one:
2. In "Optimal trade-offs between complexity and error" graph. I can find a model of complexity of 4 and an error of 15%. However, for this particular algorithm the accuracy was 72%. I guess I am not sure on how these two relate to each other.The 15% error (= 85% accuracy) are thetrainingerror during the feature selection run for finding the points on this trade-off chart. The 72% accuracy (= 28% error rate) is thetesterror for this feature set on a hold-out set which wasnotused for running the feature engineering optimization. It is important that you do a correct validation for the feature selection as well, not just for the model building. Here you have a perfect example why: the error rate which can be expected in production is only 28%, not 15%! You can again open up the process for the particular model to see the details about the validation there.
3.鉴于以上,最好的方法是什么know the critical inputs in a dataset?I know this may sound a bit philosophical, but in my opinion there is no "critical inputs for the data set". There is only "critical inputsfor a specific modelon a data set". Each model type picks up different things in the data, some can work with feature interactions, some cannot. So different features often are important for one model but less for others. I am not a big fan of "averaging" those rankings across different model types. I know that people are doing this, it just does not make a lot of sense to me. Even you weigh the ranks for average building based on the model performance things are not much better in my opinion.I rather would argue that you should identify a good model (based on the validation performance) and then state what are the most critical features for THAT model. If you run the feature selection with setting "Accuracy", this will be in general the set in the top left corner of the Pareto trade-off chart.
But this only tells you the set of features, not which feature is more or less important within that set. One way of figuring this out is to look at the colors of the Predictions entry. Columns with a lot of bolder colors across all rows in general are more important than those which are mostly light-colored. One of the next versions will put this into numbers, but for now you need to go visual. This is also true BTW if you do not apply feature selection at all.
Hope this helps,
Ingo
9
Answers
I really liked this statement
and I think this is what I was missing! Thanks again.
No, it is a numerical value (in this particular case, it is the spacing between two different components say columns in a building). Thank you
Sorry for the late reply. I am re-running the analysis using a different machine and will get back to you sometime tom. or Tuesday morning (I can definitely share the data + screen(s)). Thank you for your time.
So, I have re-ran the analysis on two different computers and found the following:
1. My home PC yields good results as you can here were the factor (s) has a + correlation (as expected):
2. My Office PC shows that the factor "s" has a - correlation (which is not quite true).
This is what I have found, the database files are identical (expect one had two columns with rounded digits - the 2nd case), so I guess this was the issue. Thank you!
Ingo