"Bug in ModelApplier?"

wokonwokon MemberPosts:8Contributor II
edited May 2019 inHelp
Intro: First of all, I would like to congratulate the Rapid-I team to this great piece of software. The user interface and philosophy behind the data and operator handling is well-designed, intuitive and the set of algorithms & visualizations is very rich.

However, I stumbled over quite a bug when I tried to solve as an exercise the DMC'2007 challenge with RapidMiner. It seems to me that something is going wrong with the ModelApplier when combining MetaCost with certain datasets.

Bug: ModelApplier seems to change the label headings in a dataset, and this leads to completely different classification errors on the same data.

How to reproduce: There are two small datasets, dmc2007_test_small.csv and dmc2007_test_sm_2.csv attached to this post. The datasets contain each exactly the same set of 149 records, with the only difference that the order of the records is slightly rearranged: Labels are N…NBN…NA… in dmc2007_test_small.csv and N…NAN…NB… in dmc2007_test_sm_2.csv (only two lines interchanged).
When you run dmc2007_test_small.csv through the following script, the number of B-labels changes completely (from 11 to 23) when you pass the data through ModelApplier (see attached screenshots in my_results.pdf, theclassification error goes from 30% to 26%). This is not the case with dmc2007_test_sm_2.csv, there everything is OK. The script is



<参数键="filename" value="dmc2007_test_small.csv"/>
<参数键="id_column" value="1"/>
<参数键="label_column" value="22"/>


<参数键="model_file" value="dmc2007-dt.mod"/>







<参数键="N" value="1.0"/>
<参数键="A" value="999.0"/>
<参数键="B" value="1.0"/>

<参数键="classification_error" value="true"/>
<参数键="keep_example_set" value="true"/>


Remark: The model dmc2007-dt.mod can be trained using the script below. dmc2007_test_sm_2.csv has the same order of label appearance as in the training data set. Here is the training script:



<参数键="filename" value="dmc2007_train_small.csv"/>
<参数键="id_column" value="1"/>
<参数键="label_column" value="22"/>


<参数键="keep_example_set" value="true"/>


<参数键="model_file" value="dmc2007-dt2.mod"/>







<参数键="N" value="1.0"/>
<参数键="A" value="999.0"/>
<参数键="B" value="1.0"/>

<参数键="classification_error" value="true"/>
<参数键="keep_example_set" value="true"/>


<参数键="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>
<参数键="keep_exampleSet" value="true"/>



This seems somewhat disturbing to me since ModelApplier changes the incoming data (“label”) which it is expected to read only.
And of course things can get much worse: if we put a record with label “B” as first record of the dataset (again the set is exactly the same) we get an appearentclassification error of 86%(which is again due to the wrong labels, the predictions of the model are exactly the same).

Recently I found out: The bug is not dependent on the MetaCost part of the training model, the same thing happens if we just use a decision tree as model.

Another topic: it is not clear to me how the rows and columns in the cost matrix connect to the labels (at least I can not see it in the documentation, however I found it out by try-and-error that probably the order of occurrence in the training set defines the rows). It would be nice to have the cost matrix interface extended in such a way that it is clear what is true / predicted (row or column?) and which line corresponds to what label.

Wish you all the best for your product, we are currently considering using it in some of our Master and Bachelor Data Mining courses.

Best regards

Wolfgang Konen

Institut für Informatik,
FH Köln - Campus Gummersbach
Steinmüllerallee 1
51643 Gummersbach
www.gm.fh-koeln.de/~konen

P.S: Since no one reported to my bug description ID: 2686544 in SourceForge’s Rapid-I-Bug-Tracker (March, 13th), I post it here again. I tried to put it in a more concise form so that you can see better the error:). Just as a note: If you solve this, also the Bug with ID: 2686544 is done. Hope to see some sort of reaction this time...

P.P.S.: If you do not maintain the BugTracker at SourceForge (which I can understand, you have already lots to do with the forum), it would perhaps be nice to put a note saying so inhttp://sourceforge.net/tracker/?group_id=114160&;atid=667390;)

WK



[attachment deleted by admin]
Tagged:

Answers

  • haddockhaddock MemberPosts:849Maven
    Hi,

    I've downloaded and unzipped the data and re-arranged the XML to include the two test files, and to keep the generated model going throughout, like this.


    <参数键="filename" value="C:\Users\CJFP\Downloads\dmc2007_data_small\dmc2007_train_small.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="keep_example_set" value="true"/>


    <参数键="keep_model" value="true"/>




    <参数键="keep_exampleSet" value="true"/>
    <参数键="cost_matrix" value="[0.0 0.0 0.0;1.0 -3.0 1.0;1.0 1.0 -6.0]"/>


    <参数键="filename" value="C:\Users\CJFP\Downloads\dmc2007_data_small\dmc2007_test_sm_2.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="keep_model" value="true"/>




    <参数键="filename" value="C:\Users\CJFP\Downloads\dmc2007_data_small\dmc2007_test_small.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="keep_model" value="true"/>




    But if I run this code I cannot replicate your problem, because the decision tree just produces "N" in all cases, for all datasets, "A" and "B" never show up. I must have made a stupid mistake somewhere, but I'm damned if I can see where it is. On the other hand the cost evaluator does seem to switch the As and Bs around in the subsequent test datasets, which can't be good.

    Perhaps others can have a bash and lighten my darkness...



  • wokonwokon MemberPosts:8Contributor II
    I followed the lines of your script and could get a decision tree which is not always saying "N" (although I cannot spot anything wrong in your script either).
    I am using however RapidMiner 4.3, perhaps this makes a difference.

    Anyhow, I give you the following code below (which is similar to your script)



    <参数键="filename" value="C:\user\datasets\Vorlesungen\DMC-Cup\DMC2007\dmc2007_train_small.csv"/>
    <参数键="id_column" value="1"/>
    <参数键="label_column" value="22"/>


    <参数键="keep_example_set" value="true"/>




    <参数键="keep_model" value="true"/>




    <参数键="classification_error" value="true"/>


    <参数键="model_file" value="C:\Dokumente und Einstellungen\wolfgang\Eigene Dateien\rm_workspace\DMC2007-rm\dmc2007-dt2.mod"/>


    <参数键="model_file" value="C:\Dokumente und Einstellungen\wolfgang\Eigene Dateien\rm_workspace\DMC2007-rm\dmc2007-dt2.mod"/>


    <参数键="filename" value="C:\user\datasets\Vorlesungen\DMC-Cup\DMC2007\dmc2007_test_sm_3.csv"/>
    <参数键="id_column" value="1"/>
    <参数键="label_column" value="22"/>




    <参数键="keep_model" value="true"/>




    <参数键="classification_error" value="true"/>
    <参数键="keep_example_set" value="true"/>


    <参数键="filename" value="C:\user\datasets\Vorlesungen\DMC-Cup\DMC2007\dmc2007_test_small.csv"/>
    <参数键="id_column" value="1"/>
    <参数键="label_column" value="22"/>




    <参数键="keep_model" value="true"/>




    <参数键="classification_error" value="true"/>
    <参数键="keep_example_set" value="true"/>

    It has the DT-building operators disabled, instead it reads the DT-model from dmc2007-dt2.mod, which I give you below as attachment in the ZIP (along with dmc2007_test_sm_3.csv). With this you should be able to reproduce first aclassification error 86.5%and then aclassification error 24.1%and you can see that the labels for "true B" and "true N" are interchanged.

    This leaves us with the bug in its cleanest form...

    Regards
    WK

    [attachment deleted by admin]
  • haddockhaddock MemberPosts:849Maven
    Hi Wolfgang,

    Yep, I get the same. For those of us that optimise classifications this is pretty scarey stuff, but many thanks for bringing it to our attention.

    If it is related to issues mentioned inhttp://rapid-i.com/rapidforum/index.php/topic,782.0.htmlthen it is high time it was put to bed.

    Thanks again.
  • steffensteffen MemberPosts:347Maven
    Hello Wolfgang, Haddock

    Same here. So i tried to load the used files separately and saved them in RapidMiner format (that means *.aml, *.dat). As you can clearly see, the labels interchanged because the internal mapping has changed:

    from testsmall:
    name = "COUPON"
    sourcecol = "22"
    valuetype = "nominal">
    N
    B
    A


    from trainsmall:
    name = "COUPON"
    sourcecol = "22"
    valuetype = "nominal">
    N
    A
    B

    这是什么意思?标准RapidMiner形式at for ExampleSets stores all data in an array of Numbers. Nominal Values are stored using a mapping, which mapps every internal number to the real (exernal) (string-)value.

    So... I have changed the sequence manually in the aml files... which results in a constant "quality" of 22,82%.

    Here is the process:
    • Run the complete process to see the interchanging and performance jumping
    • Deactivate the first operatorchain by disabling it and change the sequence manually in the stored aml - files
    • rerun the process to see what I have seen




    <参数键="filename" value="dmc2007_train_small.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="example_set_file" value="dmc2007_train_small.dat"/>
    <参数键="attribute_description_file" value="dmc2007_train_small.aml"/>


    <参数键="io_object" value="ExampleSet"/>


    <参数键="filename" value="dmc2007_test_sm_3.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="example_set_file" value="dmc2007_test_sm_3.dat"/>
    <参数键="attribute_description_file" value="dmc2007_test_sm_3.aml"/>


    <参数键="io_object" value="ExampleSet"/>


    <参数键="filename" value="dmc2007_test_small.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="example_set_file" value="dmc2007_test_small.dat"/>
    <参数键="attribute_description_file" value="dmc2007_test_small.aml"/>


    <参数键="io_object" value="ExampleSet"/>




    <参数键="attributes" value="dmc2007_train_small.aml"/>


    <参数键="keep_example_set" value="true"/>


    <参数键="keep_model" value="true"/>




    <参数键="classification_error" value="true"/>




    <参数键="model_file" value="dmc2007-dt2.mod"/>




    <参数键="model_file" value="dmc2007-dt2.mod"/>


    <参数键="attributes" value="dmc2007_test_sm_3.aml"/>


    <参数键="keep_model" value="true"/>




    <参数键="keep_example_set" value="true"/>
    <参数键="classification_error" value="true"/>




    <参数键="attributes" value="dmc2007_test_small.aml"/>


    <参数键="keep_model" value="true"/>




    <参数键="keep_example_set" value="true"/>
    <参数键="classification_error" value="true"/>





    Conclusion
    It is not a bug of ModelApplier, it is a bug of the way RM stores the data internally. Normally, the data storage should not affect the usage of the data (if the values are correctly retrieved). I guess this is the same problem as here (http://rapid-i.com/rapidforum/index.php/topic,281.0.html), which has not been fixed yet.

    Workaround
    Store the data in the RM format and adjust the critical parts of the *.aml - file manually.

    kind regards,

    Steffen

  • wokonwokon MemberPosts:8Contributor II
    Hi Haddock,
    thanks for confirming my results.

    Hello Steffen,
    thanks for your reply which just came minutes before I was about to post a similar workaround I found in the last hour. Yep, the workaround works! When I use instead of CVSExampleSource the operator ExampleSource with appropriate AML- and DAT-files andwhen I take care, that the order of the -tags is the same in each AML-file, then I get the same results with each rearranged dataset.

    然而,仍有伤疤y trap for a newcomer to RapidMiner. :-\

    But thanks again for your fast reply.

    Wolfgang
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Hi,

    然而,仍有伤疤y trap for a newcomer to RapidMiner.
    Yes, it really is. This is a basic problem of meta data management which comes with data sources which are not providing them (actually all beside the RM .aml files and Arff).

    The good news is that we are currently working on a new way of meta data storage and handling for RM 5.0 which will allow to (re-)use meta data stored together with the data in a data repository. Then each operator will transform the meta data accordingly which have a nice side-effect: in future versions you will also be able to see how the meta-data looks like at almost arbitrary places of the process without running it...

    The bad news: until then, you have to ensure the correctness of the meta data yourself which can be easily done by using the same .aml file for corresponding data sets (just replace the path to the data file in the header).

    Cheers,
    Ingo
  • wotsiznamizwotsiznamiz MemberPosts:9Contributor II
    I am still having issues -- I have used the following code, and before I run it, I confirm that the AML file is correctly representing my data. And yet it still somehow messes up my labels!

    By the way, I'm also getting this error msg:
    [Warning] Kernel Model: The order of attributes is not equal for the training and the application example set. This might lead to problems for some models.

    Here's my code - please help!






    <参数键="logverbosity" value="init"/>
    <参数键="logfile" value="OUT_%{process_name}_RootLog0.log"/>
    <参数键="resultfile" value="OUT_%{process_name}_RootResults0.res"/>
    <参数键="random_seed" value="2001"/>
    <参数键="encoding" value="SYSTEM"/>

    <参数键="attributes" value="C:\Desktop\RapidMiner\_20090404_NPScs_ALL_Dec08_KWA\OUT__20090404_csNPS_Dec08_WholeShebang_VAL_AttDescFile_ModVal.aml"/>
    <参数键="sample_ratio" value="1.0"/>
    <参数键="sample_size" value="-1"/>
    <参数键="permutate" value="false"/>
    <参数键="decimal_point_character" value="."/>
    <参数键="column_separators" value=",\s*|;\s*|\s+"/>
    <参数键="use_comment_characters" value="true"/>
    <参数键="comment_chars" value="//www.turtlecreekpls.com/community/discussion/3255/#"/>
    <参数键="use_quotes" value="false"/>
    <参数键= value =“quote_character“;" / >
    <参数键="quoting_escape_character" value="\"/>
    <参数键="trim_lines" value="false"/>
    <参数键="datamanagement" value="double_array"/>
    <参数键="local_random_seed" value="-1"/>


    <参数键="model_file" value="C:\Desktop\RapidMiner\_20090404_NPScs_ALL_Dec08_KWA\OUT__20090404_csNPS_Dec08_WholeShebang_Model_ModDevOutput2.mod"/>


    <参数键="keep_model" value="true"/>


    <参数键="create_view" value="false"/>


    <参数键="example_set_file" value="OUT_%{process_name}_ExampleSetFile_ModValOutput_LiftCurve.dat"/>
    <参数键="format" value="special_format"/>
    <参数键="special_format" value="$i $l $p $d"/>
    <参数键="fraction_digits" value="-1"/>
    <参数键= " quote_nominal_values " value = " true "/>
    <参数键="zipped" value="false"/>
    <参数键="overwrite_mode" value="overwrite"/>




  • wotsiznamizwotsiznamiz MemberPosts:9Contributor II
    thanks to ingo, wolfgang, and others for their help - I finally figured out as a newbie how to do the workaround, and I have a validated model as a result.

    I knew I was going to love Rapidminer even more when i could get the model validated -- woo hoo!
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University ProfessorPosts:1,751RM Founder
    Hi,

    I knew I was going to love Rapidminer even more when i could get the model validated -- woo hoo!
    Welcome in the club:D

    Have fun and all the best,
    Ingo
  • haddockhaddock MemberPosts:849Maven
    We all agree that a label mangling model applier is as useful as a fart in a space-suit, the question is what to do about it, not least because of the volume of posts it might generate. When this issue surfaced I was convinced that I'd read something relevant about label values, and have since been searching for it in the darker recesses of what is left of my mind.

    等待……(小筒辊)

    WarningAs the java ResultSetMetaData interface does not provide information
    about the possible values of nominal attributes, the internal indices the
    nominal values are mapped to will depend on the ordering they appear in the
    table. This may cause problems only when processes are split up into a training
    process and an application or testing process. For learning schemes which
    are capable of handling nominal attributes, this is not a problem. If a learning
    scheme like a SVM is used with nominal data, RapidMiner pretends that
    nominal attributes are numerical and uses indices for the nominal values as their
    numerical value. A SVM may perform well if there are only two possible values.
    If a test set is read in another process, the nominal values may be assigned different
    indices, and hence the SVM trained is useless. This is not a problem for
    label attributes, since the classes can be speci ed using the classes parameter
    and hence, all learning schemes intended to use with nominal data are safe to
    use.

    Rapidminer-4.3-tutorial.pdf page 103.

    So there we have it, we were all warned. Moving swiftly along it now turns out that this problem can also be avoided completely if you use a database example set and fill in the blanks, here's a rework of Wokon's example which returns exactly what it should, namely 23.49% in both cases under 4.4 Enterprise.

    <参数键="logverbosity" value="all"/>
    <参数键="logfile" value="C:\Users\CJFP\Documents\rm_workspace\prof2.log"/>

    <参数键="filename" value="C:\Users\CJFP\Documents\dmc2007_data_small-2\dmc2007_test_sm_2.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="database_system" value="Microsoft SQL Server (Microsoft)"/>
    <参数键="database_url" value="jdbc:sqlserver://localhost:1433;databaseName=Tradestation"/>
    <参数键="username" value="sa"/>
    <参数键="password" value="wL8/6ZO7YrXKa8XgQd4v7g=="/>
    <参数键="table_name" value="Table1"/>
    <参数键="overwrite_mode" value="overwrite"/>


    <参数键="database_system" value="Microsoft SQL Server (Microsoft)"/>
    <参数键="database_url" value="jdbc:sqlserver://localhost:1433;databaseName=Tradestation"/>
    <参数键="username" value="sa"/>
    <参数键="password" value="wL8/6ZO7YrXKa8XgQd4v7g=="/>
    <参数键="table_name" value="Table1"/>
    <参数键="label_attribute" value="COUPON"/>
    <参数键="id_attribute" value="ID"/>
    <参数键="classes" value="N A B"/>


    <参数键="model_file" value="C:\Users\CJFP\Documents\dmc2007_data_small-2\dmc2007-dt2.mod"/>






    <参数键="classification_error" value="true"/>




    <参数键="filename" value="C:\Users\CJFP\Documents\dmc2007_data_small-2\dmc2007_test_small.csv"/>
    <参数键="label_column" value="22"/>
    <参数键="id_column" value="1"/>


    <参数键="database_system" value="Microsoft SQL Server (Microsoft)"/>
    <参数键="database_url" value="jdbc:sqlserver://localhost:1433;databaseName=Tradestation"/>
    <参数键="username" value="sa"/>
    <参数键="password" value="wL8/6ZO7YrXKa8XgQd4v7g=="/>
    <参数键="table_name" value="Table2"/>
    <参数键="overwrite_mode" value="overwrite"/>


    <参数键="database_system" value="Microsoft SQL Server (Microsoft)"/>
    <参数键="database_url" value="jdbc:sqlserver://localhost:1433;databaseName=Tradestation"/>
    <参数键="username" value="sa"/>
    <参数键="password" value="wL8/6ZO7YrXKa8XgQd4v7g=="/>
    <参数键="table_name" value="Table2"/>
    <参数键="label_attribute" value="COUPON"/>
    <参数键="id_attribute" value="ID"/>
    <参数键="classes" value="N A B"/>


    <参数键="model_file" value="C:\Users\CJFP\Documents\dmc2007_data_small-2\dmc2007-dt2.mod"/>


    <参数键="keep_model" value="true"/>




    <参数键="keep_example_set" value="true"/>
    <参数键="classification_error" value="true"/>



    The thing that saves the day is the parameter , having a similar, and probably required parameter also on all the file input operators would put this one to bed, and save us from the "me too" bug posts.
Sign InorRegisterto comment.