decision tree does not fit the roc

cam8cam8 MemberPosts:2Contributor I
edited July 2020 inHelp

Hello,

i work actually at adata setfor a decision tree solution inspired from your video "finding the right model".

The roc based on the data set shows me a good result for decision tree and random forest.

But when i want to create a simple decision tree model i get a very low recall of 1,12% for TruePositiv.

With Optimize Parameters i can increase it to 47% for TruePositiv.

But i see indocumentations, that a TruePositiv of 99% is possible.

Is something wrong with my process?

The XML-Code:























<操作符= " true " class = " numerical_to_bin激活ominal" compatibility="8.2.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="34">




































<连接from_port = to_op =“训练集的决定Tree (2)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
















<康涅狄格州ect from_port="model" to_op="Apply Model (4)" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance (4)" from_port="performance" to_port="performance 1"/>








<康涅狄格州ect from_port="input 1" to_op="DT Cross Validation (2)" to_port="example set"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="performance 1" to_port="performance"/>










<连接from_port = to_op =“训练集的决定Tree" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree" from_port="model" to_port="model"/>













<康涅狄格州ect from_port="model" to_op="Apply Model" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance" from_port="performance" to_port="performance 1"/>

















<康涅狄格州ect from_port="train 1" to_op="Decision Tree (5)" to_port="training set"/>
<康涅狄格州ect from_port="train 2" to_op="Naive Bayes (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 3" to_op="Rule Induction (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 4" to_op="Random Tree (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 5" to_op="k-NN (5)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (5)" from_port="model" to_port="model 1"/>
<康涅狄格州ect from_op="Naive Bayes (3)" from_port="model" to_port="model 2"/>
<康涅狄格州ect from_op="Rule Induction (3)" from_port="model" to_port="model 3"/>
<康涅狄格州ect from_op="Random Tree (3)" from_port="model" to_port="model 4"/>
<康涅狄格州ect from_op="k-NN (5)" from_port="model" to_port="model 5"/>














<康涅狄格州ect from_op="Retrieve transfusion.dataset1" from_port="output" to_op="Rename" to_port="example set input"/>
<康涅狄格州ect from_op="Rename" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
<康涅狄格州ect from_op="Numerical to Binominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<康涅狄格州ect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<康涅狄格州ect from_op="Select Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 1" to_op="Compare ROCs (3)" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 2" to_op="Cross Validation" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 3" to_op="DT Optimize Parameters" to_port="input 1"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="performance" to_port="result 4"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="model" to_port="result 5"/>
<康涅狄格州ect from_op="Cross Validation" from_port="model" to_port="result 2"/>
<康涅狄格州ect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
<康涅狄格州ect from_op="Compare ROCs (3)" from_port="rocComparison" to_port="result 1"/>











roc.png

roc.png 0B
roc.png 26.5K

Best Answer

  • lionelderkrikorlionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
    Solution Accepted

    Hi@cam8,

    There isn't wrong in your process.

    You have an imbalanced dataset (ratio true/false = 178/570). That's lead generally to a bad Recall.

    To increase the Recall, you can sample your data. By sampling your data, I'm obtaining Recall = 60% with a decision tree.

    然而,如果你的目标是研究of the maximum performance, decision tree is not the best choice : a better choice

    is the kNN model. After training this model, you can obtain a recall > 70 %, (with 1< k < 10) and you can obtain a theoretical recall > 90% by optimizing the value of k. But use with caution, this last case correspond to a situation of overfitting : Your model has, in deed, good performances on your training dataset, but will have bad performances on future 'unseen data'.

    I hope it helps,

    Regards,

    Lionel

    NB : The process:



































    <操作符= " true " class = " numerical_to_bin激活ominal" compatibility="8.2.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="34">

















































    <康涅狄格州ect from_port="training set" to_op="Sample (2)" to_port="example set input"/>
    <康涅狄格州ect from_op="Sample (2)" from_port="example set output" to_op="Decision Tree (2)" to_port="training set"/>
    <康涅狄格州ect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
















    <康涅狄格州ect from_port="model" to_op="Apply Model (4)" to_port="model"/>
    <康涅狄格州ect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
    <康涅狄格州ect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/>
    <康涅狄格州ect from_op="Performance (4)" from_port="performance" to_port="performance 1"/>








    <康涅狄格州ect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
    <康涅狄格州ect from_op="Extract Macro (2)" from_port="example set" to_op="DT Cross Validation (2)" to_port="example set"/>
    <康涅狄格州ect from_op="DT Cross Validation (2)" from_port="model" to_port="model"/>
    <康涅狄格州ect from_op="DT Cross Validation (2)" from_port="performance 1" to_port="performance"/>




































    <康涅狄格州ect from_port="training set" to_op="Sample (3)" to_port="example set input"/>
    <康涅狄格州ect from_op="Sample (3)" from_port="example set output" to_op="k-NN (2)" to_port="training set"/>
    <康涅狄格州ect from_op="k-NN (2)" from_port="model" to_port="model"/>
















    <康涅狄格州ect from_port="model" to_op="Apply Model (2)" to_port="model"/>
    <康涅狄格州ect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
    <康涅狄格州ect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
    <康涅狄格州ect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>








    <康涅狄格州ect from_port="input 1" to_op="Extract Macro (3)" to_port="example set"/>
    <康涅狄格州ect from_op="Extract Macro (3)" from_port="example set" to_op="DT Cross Validation (3)" to_port="example set"/>
    <康涅狄格州ect from_op="DT Cross Validation (3)" from_port="model" to_port="model"/>
    <康涅狄格州ect from_op="DT Cross Validation (3)" from_port="performance 1" to_port="performance"/>
















    <康涅狄格州ect from_port="train 1" to_op="Decision Tree (5)" to_port="training set"/>
    <康涅狄格州ect from_port="train 2" to_op="Naive Bayes (3)" to_port="training set"/>
    <康涅狄格州ect from_port="train 3" to_op="Rule Induction (3)" to_port="training set"/>
    <康涅狄格州ect from_port="train 4" to_op="Random Tree (3)" to_port="training set"/>
    <康涅狄格州ect from_port="train 5" to_op="k-NN (5)" to_port="training set"/>
    <康涅狄格州ect from_op="Decision Tree (5)" from_port="model" to_port="model 1"/>
    <康涅狄格州ect from_op="Naive Bayes (3)" from_port="model" to_port="model 2"/>
    <康涅狄格州ect from_op="Rule Induction (3)" from_port="model" to_port="model 3"/>
    <康涅狄格州ect from_op="Random Tree (3)" from_port="model" to_port="model 4"/>
    <康涅狄格州ect from_op="k-NN (5)" from_port="model" to_port="model 5"/>



































    <康涅狄格州ect from_port="training set" to_op="Extract Macro" to_port="example set"/>
    <康涅狄格州ect from_op="Extract Macro" from_port="example set" to_op="Sample" to_port="example set input"/>
    <康涅狄格州ect from_op="Sample" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
    <康涅狄格州ect from_op="Decision Tree" from_port="model" to_port="model"/>













    <康涅狄格州ect from_port="model" to_op="Apply Model" to_port="model"/>
    <康涅狄格州ect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
    <康涅狄格州ect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
    <康涅狄格州ect from_op="Performance" from_port="performance" to_port="performance 1"/>








    <康涅狄格州ect from_op="Read CSV" from_port="output" to_op="Rename" to_port="example set input"/>
    <康涅狄格州ect from_op="Rename" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
    <康涅狄格州ect from_op="Numerical to Binominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <康涅狄格州ect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
    <康涅狄格州ect from_op="Select Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
    <康涅狄格州ect from_op="Multiply (2)" from_port="output 1" to_op="Compare ROCs (3)" to_port="example set"/>
    <康涅狄格州ect from_op="Multiply (2)" from_port="output 2" to_port="result 4"/>
    <康涅狄格州ect from_op="Multiply (2)" from_port="output 3" to_op="kNN Optimize Parameters (2)" to_port="input 1"/>
    <康涅狄格州ect from_op="Multiply (2)" from_port="output 4" to_op="Cross Validation" to_port="example set"/>
    <康涅狄格州ect from_op="Multiply (2)" from_port="output 5" to_port="result 8"/>
    <康涅狄格州ect from_op="Multiply (2)" from_port="output 6" to_op="DT Optimize Parameters" to_port="input 1"/>
    <康涅狄格州ect from_op="DT Optimize Parameters" from_port="performance" to_port="result 9"/>
    <康涅狄格州ect from_op="DT Optimize Parameters" from_port="model" to_port="result 10"/>
    <康涅狄格州ect from_op="DT Optimize Parameters" from_port="parameter set" to_port="result 11"/>
    <康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="performance" to_port="result 5"/>
    <康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="model" to_port="result 6"/>
    <康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="parameter set" to_port="result 7"/>
    <康涅狄格州ect from_op="Compare ROCs (3)" from_port="rocComparison" to_port="result 1"/>
    <康涅狄格州ect from_op="Cross Validation" from_port="model" to_port="result 2"/>
    <康涅狄格州ect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
















    sgenzer

Answers

  • cam8cam8 MemberPosts:2Contributor I

    Hey@lionelderkrikor,

    thank you for your support.

    I was already worried about the model.

    我测试你的解决方案与宏观sample-operator.

    精度迅速增加。

    With your solution i tested a few other things and see, that the cross validation was a important fact for decreasing the accuracy.

    Now I am working with the Optimize Parameters and directly below with your solution and the model without the cross validation.

    Again: Thank you very much.

    Best Regards

    car8

    sgenzer
登录orRegisterto comment.