How does RM Decision Tree algorithm choose between two attributes that produce equally good splits?

rick_daviesrick_davies MemberPosts:11Contributor II
edited December 2018 inHelp

It seems possible that a DT algorithm will find cases that are split equally well into two different groups (branches) by say two different case attributes. How does it choose one attribute to use, in these circumstances? And what can we say about the likely structure of the tree that would have been made had the neglected attribute been used? I have not seen anything like this discussed in the literature on DTs

Best Answer

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Solution Accepted

    Well you could do this. However, I just do not see a chance for a massive improvement here to be honest. See my answer from before - I don't think that there really is a huge risk given that the whole tree is built based on choices which might be sub-optimal...

Answers

  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University ProfessorPosts:1,984RM Engineering

    Hi,

    looking at the code, I can say that in such a case, we simply use whichever attribute comes first in the data.

    Regards,

    Marco

  • rick_daviesrick_davies MemberPosts:11Contributor II

    ...which is what some others have told me.

    In which case, how do we know that the other attribute, if it was used instead, does not lead on to a better performing DT structure?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:1,761Unicorn

    You could re-order the attributes and test it?

  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder

    Interesting topic indeed. We do not know if the other attribute would not lead to a better DT. But having said this, this is actually true for all of the splits. Keep in mind that picking the one with the biggest information gain is a heuristic as well. There is no guarantee that a pick is leading to an optimal DT (optimal = best predictive power). So it could easily be that making a different pick at each split, let's say the one with the 2nd or 3rd best information gain, might lead to a more accurate tree. But the only way to find this out is to go bruteforce through all options which is not computationally feasible. And even then you might overfit to your test set you use to find out which split is the best.

    So to summarize: using information gain (and the other approaches) is a heuristic itself and does not guarantee an optimal tree to begin with. Making an arbitrary split between two good attributes does not make this heuristic much more, well, "heuristic-y" in my opinion ;-)

    Cheers,

    Ingo

    Thomas_Ott
  • rick_daviesrick_davies MemberPosts:11Contributor II

    I could do this, where I know that two attributes have been equal contenders

    But how do I find these in the first instance? Or otherwise be confident they dont exist?

    Perhaps for binary data, some sort of Hamming distance measure of similarity between all attributes could point to where there is such a risk?

  • rick_daviesrick_davies MemberPosts:11Contributor II

    Would some sort of "prior to DT analysis" feature selection process that maximised diversity among the attributes be a best bet against such a risk?

  • rick_daviesrick_davies MemberPosts:11Contributor II

    If this is the case then BigML should signal to the users when two or more attributes perform equally well, so they can explore what happens when one of these is deleted from the data set

Sign InorRegisterto comment.