Aggregation / compression instead of forecast / prediction
nicugeorgian
MemberPosts:31Guru
Hi,
I have a data set with both nominal and numerical attributes and a numerical label.
I'm trying to fit some regression tree on this set.
I would like to use the regression tree as an aggregation / compression of the data set rows and not as a forecast. Concretely, my regression tree is not going to be applied/shown to unseen data! So, the overfitting would not be problem in this case! Of course, I should avoid ending up with so many tree leaves as rows in the data set (that wouldn't be an aggregation anymore)
The goal is, however, that the trained model (the regression tree) "predicts / reflects" as much as possible the training data.
Would the regression tree (Weka W-M5P) be the best solution for this problem? If yes, how shall I choose the algorithm's parameters?
我认为这将是更好的如果我select the option "no-prunning" ...
Any ideas?
Thanks!
I have a data set with both nominal and numerical attributes and a numerical label.
I'm trying to fit some regression tree on this set.
I would like to use the regression tree as an aggregation / compression of the data set rows and not as a forecast. Concretely, my regression tree is not going to be applied/shown to unseen data! So, the overfitting would not be problem in this case! Of course, I should avoid ending up with so many tree leaves as rows in the data set (that wouldn't be an aggregation anymore)
The goal is, however, that the trained model (the regression tree) "predicts / reflects" as much as possible the training data.
Would the regression tree (Weka W-M5P) be the best solution for this problem? If yes, how shall I choose the algorithm's parameters?
我认为这将是更好的如果我select the option "no-prunning" ...
Any ideas?
Thanks!
0
Answers
if the regression tree is the best algorithm depends on your needs. If you want an understandable model, choose it. Otherwise different alternatives are possible and possibly better. But you might to have to transform your data then, because LinearRegression or SVMs don't support nominal values.
The best parameters for learners depend on your data, so you have to try it out.
Greetings,
Sebastian