"Some experience with the (new) decision trees"
Hi,
I just wanted to share some experience I gained regarding the different decision tree realisations in RapidMiner.
As an exercise I wanted to test RapidMiner on a realistically large dataset from thehttp://www.data-mining-cup.de/2007/Wettbewerb(DMC2007) which has 50000 records for training, 50000 for testing.
我第一次与RapidMiner 4.2和漂亮good results with MetaCost and DecisionTree: misclassification cost -0.141 on training set (The more negative, the better. The best participants of about 300 in the DMC2007-challenge achieved -0.1578. A very naive (bad) model would obtain 0.000 when predicting always the most frequent class N).
Then I switched to RapidMiner 4.4, since Steffen convinced me that it was better for many other bug fixes (and it is). When rerunning thesameprocess (with the revised DT of V4.4), my model now was 50x faster than V4.2, but also very "naive": misclassification cost = 0.000, because it said always "N", the trivial choice: all trees had only one leaf.
I guessed that the new pre-pruning might be the problem and activated parameter no_pre_pruning. This led to a heap space error after 6min.
I assumed that maximal_depth=20 might be too complex and decreased this parameter to 10. Now it worked, it took only 1min (as compared to 12 min in V4.2) and produced a result. But the result was at least for this dataset quantitatively inferior to V4.2, since it had only a misclassification cost of -0.091.
I then replaced operator DecisionTree by WEKA's W-REPTree operator, with parameter L=12, and got right from the start results better than all others before: misclassification costs = -0.147. And W-REPTree is incredible fast (about 5 sec) and does not consume much heap space.
Since the datasets are a little bit too large to be uploaded here, I provide them (500KB zip) underhttp://www.gm.fh-koeln.de/~konen/DMDATA/dmc2007_train.zip, if anyone wants to reproduce the results.
这是我使用的代码(你可能disable W-REPTree and enable DecisionTree to reproduce the last DT-experiment):
The W-REPTree seems to be the better choice in this case. Another suggestion: Perhaps the Rapid-I-team should consider to make the 'old' DecisionTree of V4.2 under some name like DecisionTree.4.2 available under V4.4, at least for some time. For my taste the rate and depth of changes in the operators is a little bit too high ...
Best regards
Wolfgang
I just wanted to share some experience I gained regarding the different decision tree realisations in RapidMiner.
As an exercise I wanted to test RapidMiner on a realistically large dataset from thehttp://www.data-mining-cup.de/2007/Wettbewerb(DMC2007) which has 50000 records for training, 50000 for testing.
我第一次与RapidMiner 4.2和漂亮good results with MetaCost and DecisionTree: misclassification cost -0.141 on training set (The more negative, the better. The best participants of about 300 in the DMC2007-challenge achieved -0.1578. A very naive (bad) model would obtain 0.000 when predicting always the most frequent class N).
Then I switched to RapidMiner 4.4, since Steffen convinced me that it was better for many other bug fixes (and it is). When rerunning thesameprocess (with the revised DT of V4.4), my model now was 50x faster than V4.2, but also very "naive": misclassification cost = 0.000, because it said always "N", the trivial choice: all trees had only one leaf.
I guessed that the new pre-pruning might be the problem and activated parameter no_pre_pruning. This led to a heap space error after 6min.
I assumed that maximal_depth=20 might be too complex and decreased this parameter to 10. Now it worked, it took only 1min (as compared to 12 min in V4.2) and produced a result. But the result was at least for this dataset quantitatively inferior to V4.2, since it had only a misclassification cost of -0.091.
I then replaced operator DecisionTree by WEKA's W-REPTree operator, with parameter L=12, and got right from the start results better than all others before: misclassification costs = -0.147. And W-REPTree is incredible fast (about 5 sec) and does not consume much heap space.
Since the datasets are a little bit too large to be uploaded here, I provide them (500KB zip) underhttp://www.gm.fh-koeln.de/~konen/DMDATA/dmc2007_train.zip, if anyone wants to reproduce the results.
这是我使用的代码(你可能disable W-REPTree and enable DecisionTree to reproduce the last DT-experiment):
My conclusion: Perhaps some more tests with the "new" DecisionTree could be done, if time permits: Although it is faster and has probably better results on the datasets it was tested on, it seems to be, that large datasets (or justthisdataset) provide some problems for the DT in its current parameter settings.
The W-REPTree seems to be the better choice in this case. Another suggestion: Perhaps the Rapid-I-team should consider to make the 'old' DecisionTree of V4.2 under some name like DecisionTree.4.2 available under V4.4, at least for some time. For my taste the rate and depth of changes in the operators is a little bit too high ...
Best regards
Wolfgang
Tagged:
0
Answers
thank you for sharing your experience and your experiment setup. I have only downloaded it yesterday but had only a brief look until now. Indeed, the decision tree seems to perform worse than a Weka tree or other RM learners. Anyway, the decision tree is already on my/our list for another revision. Presumably the pre pruning is the reason for producing less nice trees. Well I am not that fond of that idea to introduce multiple versions of operators in one version of RM. In my opinion that presumably leads to confusion among most of the users. Nevertheless of course we will try to approach the problem very soon and implement a decision tree version which is stable and performant. When I will have a look at the tree, I will definitely consider your training example as well. So thanks again for sharing. We will keep you up to date when we will make any progress.
Kind regards,
Tobias
I would also like to thank you for the nice description of your experiences. Actually those things are really helping us a lot because without those information we are always a bit project-driven (which is at least better than being "UCI-driven").
Just two side notes: I fully understand and I would also like to prevent those changes if they occur too often - especially for core operators like the DT learner. On the other hand, the old implementation was ridicously slow for both smaller and larger data sets and we simply had to do something about this. And although I understand your point, I personally think that the agile and dynamic style of our development actually is something many users do really like about RapidMiner. Just consider that people who are happy with everything do not nearly post as often in a public forum than people having problems. That is sad but another topic
Isn't it nice to have the ability to test all of those methods and choose the best?
Actually, the REPTree and the DecisionTree operator are like apples and bananas. A "fairer" comparison would be to compare "DecisionTree" with "W-J48". But if you already found that REPTree is best in your case this is of course great - and a nice motivation for Tobias to improve the DecisionTree in a way that it produces as good results in the same (or even better: in shorter) times :P
Thanks again for your nice comments and all the best,
Ingo
Having said this, let me comment on Tobias: Yes, I understand your point, but on the other hand, there are so many, many operators in RM that another one or two more (perhaps in a special folder 'Deprecated') would not hurt too much, won't they? But this is a matter of taste, and I understand your arguments as well.
Thanks again for the fast response, I think I get an idea now what the "Rapid" in "Rapid-I" stands for ...
Best regards
Wolfgang