A bug in normalize operator and in bugzilla

marcin_blachnikmarcin_blachnik MemberPosts:61Guru
edited November 2018 inHelp
First bug is related to Bugzilla, when I've try to file a bug a receive following error:
Software error:

Cannot determine local time zone
For help, please send mail to the webmaster ([email protected]), giving this error message and the time and date of the error.
So I decided to submit it here, as it looks like an important bug. It is related to normalize operator and independence of its output ports.
The process below shows the problem. The process simply loads the data and performs normalization. As a result only data received from res0 should be normalized, and the data received from res1 should contain the original data, but in fact both outputs are normalized. The funny thing is that when I connect "ori" output port to the process output everything works fine.
The process:











<参数键=“repository_entry”值= " / /样品s/data/Iris"/>
















Best

Marcin
Tagged:

Answers

  • JEdwardJEdward RapidMiner Certified Analyst, RapidMiner Certified Expert, MemberPosts:578Unicorn
    That's due to the way RapidMiner handles data to use less memory.
    通常当你构建一个RapidMiner过程s each operator is applied as rules to the data line (for example 'Generate Attributes'). This makes it efficient when running the process as only one set of the data needs to be stored in-memory. Multiply doesn't create a new copy of the underlying data, it just separates the streams.

    Some operations such as Normalise & Obfuscate change the underlying data in-memory so to resolve this use the Materialize Data operator to create a new exampleset in-memory. See below for your example process with the Materialize Data operator added.
    This is pretty much what the Ori output port is doing in the normalise operator.

    You can find an indepth explanation on how this works in the How to Extend RapidMiner guides, but I agree that this could be made clearer by RapidMiner itself.










    <参数键=“repository_entry”值= " / /样品s/data/Iris"/>


















  • marcin_blachnikmarcin_blachnik MemberPosts:61Guru
    Thank you for your answer but it is a bug.
    My code (independence of outputs) works perfectly on RapidMiner 5 and RapidMiner 6, it just doesn’t work on RapidMiner 7.
    There is a rule which says that whenever an operator modifies a data it creates a new attribute in the exampleTable and in the exampleSet makes a switch such that the exampleSet reference to the new modified attribute (switch a view).
    I just guess that the developer didn’t clone the exampleSet before applying modifications so all of the modification are propagated in every copy of the dataset.
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University ProfessorPosts:3,404RM Data Scientist
    Hey,

    do you know about Materialize Data?

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • marcin_blachnikmarcin_blachnik MemberPosts:61Guru
    Well

    I know what is Materialize data and how for my own purpose solve this provlem,
    but please explain me why
    1) In RM7.0 this issue appear, and in RM 6.5 everything works ok. Does it mean that RM7 is not compatible with RM 6.5 and processes have to be rechecked
    2) There is no line in the help that says about this change, and about the requirement of materialization
    3) Why Normalization operator works differently according to its output port connection. It works ok when you connect "ori" output somewhere else and then the data on the input side is not being modified, (in other words also the data on the output of the "multiply" operator is not normalized), but when the "ori" output is not connected then Normalization modifies also the exampleset on the input side.

    To show what I'm talking about in 3) take a look at this process:










    <参数键=“repository_entry”值= " / /样品s/data/Iris"/>




















    Now everything works perfect and the tree has correct values on edges, but when you remove the top most process output (connection between ori and process output) you obtain completely different tree. Even more funny thing is that this issue don't appear on Golf dataset it's just a problem with Iris (both taken from the "samples" repository) I guess that this is because in iris all regular features are numeric.
Sign InorRegisterto comment.