decision tree does not fit the roc

cam8 · May 2018

Hello,

i work actually at adata setfor a decision tree solution inspired from your video "finding the right model".

The roc based on the data set shows me a good result for decision tree and random forest.

But when i want to create a simple decision tree model i get a very low recall of 1,12% for TruePositiv.

With Optimize Parameters i can increase it to 47% for TruePositiv.

But i see indocumentations, that a TruePositiv of 99% is possible.

Is something wrong with my process?

The XML-Code:























<操作符= " true " class = " numerical_to_bin激活ominal" compatibility="8.2.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="34">




































<连接from_port = to_op =“训练集的决定Tree (2)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
















<康涅狄格州ect from_port="model" to_op="Apply Model (4)" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance (4)" from_port="performance" to_port="performance 1"/>








<康涅狄格州ect from_port="input 1" to_op="DT Cross Validation (2)" to_port="example set"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="performance 1" to_port="performance"/>










<连接from_port = to_op =“训练集的决定Tree" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree" from_port="model" to_port="model"/>













<康涅狄格州ect from_port="model" to_op="Apply Model" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance" from_port="performance" to_port="performance 1"/>

















<康涅狄格州ect from_port="train 1" to_op="Decision Tree (5)" to_port="training set"/>
<康涅狄格州ect from_port="train 2" to_op="Naive Bayes (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 3" to_op="Rule Induction (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 4" to_op="Random Tree (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 5" to_op="k-NN (5)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (5)" from_port="model" to_port="model 1"/>
<康涅狄格州ect from_op="Naive Bayes (3)" from_port="model" to_port="model 2"/>
<康涅狄格州ect from_op="Rule Induction (3)" from_port="model" to_port="model 3"/>
<康涅狄格州ect from_op="Random Tree (3)" from_port="model" to_port="model 4"/>
<康涅狄格州ect from_op="k-NN (5)" from_port="model" to_port="model 5"/>














<康涅狄格州ect from_op="Retrieve transfusion.dataset1" from_port="output" to_op="Rename" to_port="example set input"/>
<康涅狄格州ect from_op="Rename" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
<康涅狄格州ect from_op="Numerical to Binominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<康涅狄格州ect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<康涅狄格州ect from_op="Select Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 1" to_op="Compare ROCs (3)" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 2" to_op="Cross Validation" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 3" to_op="DT Optimize Parameters" to_port="input 1"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="performance" to_port="result 4"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="model" to_port="result 5"/>
<康涅狄格州ect from_op="Cross Validation" from_port="model" to_port="result 2"/>
<康涅狄格州ect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
<康涅狄格州ect from_op="Compare ROCs (3)" from_port="rocComparison" to_port="result 1"/>

lionelderkrikor · May 2018

Hi@cam8,

There isn't wrong in your process.

You have an imbalanced dataset (ratio true/false = 178/570). That's lead generally to a bad Recall.

To increase the Recall, you can sample your data. By sampling your data, I'm obtaining Recall = 60% with a decision tree.

然而,如果你的目标是研究of the maximum performance, decision tree is not the best choice : a better choice

is the kNN model. After training this model, you can obtain a recall > 70 %, (with 1< k < 10) and you can obtain a theoretical recall > 90% by optimizing the value of k. But use with caution, this last case correspond to a situation of overfitting : Your model has, in deed, good performances on your training dataset, but will have bad performances on future 'unseen data'.

I hope it helps,

Regards,

Lionel

NB : The process:



































<操作符= " true " class = " numerical_to_bin激活ominal" compatibility="8.2.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="34">

















































<康涅狄格州ect from_port="training set" to_op="Sample (2)" to_port="example set input"/>
<康涅狄格州ect from_op="Sample (2)" from_port="example set output" to_op="Decision Tree (2)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
















<康涅狄格州ect from_port="model" to_op="Apply Model (4)" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance (4)" from_port="performance" to_port="performance 1"/>








<康涅狄格州ect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
<康涅狄格州ect from_op="Extract Macro (2)" from_port="example set" to_op="DT Cross Validation (2)" to_port="example set"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="performance 1" to_port="performance"/>




































<康涅狄格州ect from_port="training set" to_op="Sample (3)" to_port="example set input"/>
<康涅狄格州ect from_op="Sample (3)" from_port="example set output" to_op="k-NN (2)" to_port="training set"/>
<康涅狄格州ect from_op="k-NN (2)" from_port="model" to_port="model"/>
















<康涅狄格州ect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>








<康涅狄格州ect from_port="input 1" to_op="Extract Macro (3)" to_port="example set"/>
<康涅狄格州ect from_op="Extract Macro (3)" from_port="example set" to_op="DT Cross Validation (3)" to_port="example set"/>
<康涅狄格州ect from_op="DT Cross Validation (3)" from_port="model" to_port="model"/>
<康涅狄格州ect from_op="DT Cross Validation (3)" from_port="performance 1" to_port="performance"/>
















<康涅狄格州ect from_port="train 1" to_op="Decision Tree (5)" to_port="training set"/>
<康涅狄格州ect from_port="train 2" to_op="Naive Bayes (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 3" to_op="Rule Induction (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 4" to_op="Random Tree (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 5" to_op="k-NN (5)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (5)" from_port="model" to_port="model 1"/>
<康涅狄格州ect from_op="Naive Bayes (3)" from_port="model" to_port="model 2"/>
<康涅狄格州ect from_op="Rule Induction (3)" from_port="model" to_port="model 3"/>
<康涅狄格州ect from_op="Random Tree (3)" from_port="model" to_port="model 4"/>
<康涅狄格州ect from_op="k-NN (5)" from_port="model" to_port="model 5"/>



































<康涅狄格州ect from_port="training set" to_op="Extract Macro" to_port="example set"/>
<康涅狄格州ect from_op="Extract Macro" from_port="example set" to_op="Sample" to_port="example set input"/>
<康涅狄格州ect from_op="Sample" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree" from_port="model" to_port="model"/>













<康涅狄格州ect from_port="model" to_op="Apply Model" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance" from_port="performance" to_port="performance 1"/>








<康涅狄格州ect from_op="Read CSV" from_port="output" to_op="Rename" to_port="example set input"/>
<康涅狄格州ect from_op="Rename" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
<康涅狄格州ect from_op="Numerical to Binominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<康涅狄格州ect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<康涅狄格州ect from_op="Select Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 1" to_op="Compare ROCs (3)" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 2" to_port="result 4"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 3" to_op="kNN Optimize Parameters (2)" to_port="input 1"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 4" to_op="Cross Validation" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 5" to_port="result 8"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 6" to_op="DT Optimize Parameters" to_port="input 1"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="performance" to_port="result 9"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="model" to_port="result 10"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="parameter set" to_port="result 11"/>
<康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="performance" to_port="result 5"/>
<康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="model" to_port="result 6"/>
<康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="parameter set" to_port="result 7"/>
<康涅狄格州ect from_op="Compare ROCs (3)" from_port="rocComparison" to_port="result 1"/>
<康涅狄格州ect from_op="Cross Validation" from_port="model" to_port="result 2"/>
<康涅狄格州ect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>

cam8 · May 2018

Hey@lionelderkrikor,

thank you for your support.

I was already worried about the model.

我测试你的解决方案与宏观sample-operator.

精度迅速增加。

With your solution i tested a few other things and see, that the cross validation was a important fact for decreasing the accuracy.

Now I am working with the Optimize Parameters and directly below with your solution and the model without the cross validation.

Again: Thank you very much.

Best Regards

car8

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

decision tree does not fit the roc

Best Answer

Answers