"Bug in MinimalEntropyParitioning?"

Legacy UserLegacy User MemberPosts:0Newbie
edited May 2019 inHelp
Hello everybody,

I get strange results when I apply MinimumEntropyPartitioning on some datasets and wonder whether this is due to a bug in the implementation.

让我说明问题:have a dataset with one attribute ("X") and one label with two possible values.
There are 6 possible values for X, 1 to 6. In total, I have 1116 rows, with the following target label distributions:

X-value #negatives #positives #rows
1.0 124 62 186
2.0 124 62 186
3.0 0 186 186
4.0 0 186 186
5.0 124 62 186
6.0 124 62 186

Now of course I would expect a discretization into [-infty,2], ]2,4], ]4,infty] with 372. Instead, I get:

range1 [-∞ - 2] (372), range2 [2 - 5] (558), range3 [5 - ∞] (186)

It seems like there is a bug in the operator that does not correctly distinguish open and closed interval limits.
Does anybody know of a solution or a workaround?

Best,

Henrik
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:2,531Unicorn
    Hi Henrik,
    this seems to be a problem indeed. Perhabs you could add a tiny litte noise on your values. Resolving the not uniquenes causing your problem.

    But to solve it in general I will take a look at the code.

    Greetings,
    Sebastian
  • Legacy UserLegacy User MemberPosts:0Newbie
    Hi Sebastian,

    thanks for the reply, I also thought that the problem could be diminished if I had more continuous values. But of course if would be best if you could fix the problem in general.

    Best,

    Henrik
  • Legacy UserLegacy User MemberPosts:0Newbie
    Hi,

    in the meantime I found the bug and fixed it. The bug is in the function
    private Double getMinEntropySplitpoint(LinkedList truncatedExamples, Attribute label) {

    in the class MinimalEntropyDiscretization. It does not consider the case where a split results in 0 examples of one class. Here is the fix:


    // Calculate entropies.
    double entropy1 = 0.0d;
    for (int i = 0; i < label.getMapping().size(); i++) {
    entropy1 -= frequencies1* MathFunctions.ld(frequencies1);
    }
    double entropy2 = 0.0d;
    for (int i = 0; i < label.getMapping().size(); i++) {
    entropy2 -= frequencies2* MathFunctions.ld(frequencies2);
    }


    Best,

    Henrik
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Hi Henrik,

    thanks for sending this in! We will check and integrate your suggestion as soon as possible.

    Cheers,
    Ingo
Sign InorRegisterto comment.