decision tree does not fit the roc
Hello,
i work actually at adata setfor a decision tree solution inspired from your video "finding the right model".
The roc based on the data set shows me a good result for decision tree and random forest.
But when i want to create a simple decision tree model i get a very low recall of 1,12% for TruePositiv.
With Optimize Parameters i can increase it to 47% for TruePositiv.
But i see indocumentations, that a TruePositiv of 99% is possible.
Is something wrong with my process?
The XML-Code:
<操作符= " true " class = " numerical_to_bin激活ominal" compatibility="8.2.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="34">
<连接from_port = to_op =“训练集的决定Tree (2)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_port="model" to_op="Apply Model (4)" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance (4)" from_port="performance" to_port="performance 1"/>
<康涅狄格州ect from_port="input 1" to_op="DT Cross Validation (2)" to_port="example set"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="performance 1" to_port="performance"/>
<连接from_port = to_op =“训练集的决定Tree" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree" from_port="model" to_port="model"/>
<康涅狄格州ect from_port="model" to_op="Apply Model" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance" from_port="performance" to_port="performance 1"/>
<康涅狄格州ect from_port="train 1" to_op="Decision Tree (5)" to_port="training set"/>
<康涅狄格州ect from_port="train 2" to_op="Naive Bayes (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 3" to_op="Rule Induction (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 4" to_op="Random Tree (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 5" to_op="k-NN (5)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (5)" from_port="model" to_port="model 1"/>
<康涅狄格州ect from_op="Naive Bayes (3)" from_port="model" to_port="model 2"/>
<康涅狄格州ect from_op="Rule Induction (3)" from_port="model" to_port="model 3"/>
<康涅狄格州ect from_op="Random Tree (3)" from_port="model" to_port="model 4"/>
<康涅狄格州ect from_op="k-NN (5)" from_port="model" to_port="model 5"/>
<康涅狄格州ect from_op="Retrieve transfusion.dataset1" from_port="output" to_op="Rename" to_port="example set input"/>
<康涅狄格州ect from_op="Rename" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
<康涅狄格州ect from_op="Numerical to Binominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<康涅狄格州ect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<康涅狄格州ect from_op="Select Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 1" to_op="Compare ROCs (3)" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 2" to_op="Cross Validation" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 3" to_op="DT Optimize Parameters" to_port="input 1"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="performance" to_port="result 4"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="model" to_port="result 5"/>
<康涅狄格州ect from_op="Cross Validation" from_port="model" to_port="result 2"/>
<康涅狄格州ect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
<康涅狄格州ect from_op="Compare ROCs (3)" from_port="rocComparison" to_port="result 1"/>
Best Answer
-
lionelderkrikor Moderator, RapidMiner Certified Analyst, MemberPosts:1,195Unicorn
Hi@cam8,
There isn't wrong in your process.
You have an imbalanced dataset (ratio true/false = 178/570). That's lead generally to a bad Recall.
To increase the Recall, you can sample your data. By sampling your data, I'm obtaining Recall = 60% with a decision tree.
然而,如果你的目标是研究of the maximum performance, decision tree is not the best choice : a better choice
is the kNN model. After training this model, you can obtain a recall > 70 %, (with 1< k < 10) and you can obtain a theoretical recall > 90% by optimizing the value of k. But use with caution, this last case correspond to a situation of overfitting : Your model has, in deed, good performances on your training dataset, but will have bad performances on future 'unseen data'.
I hope it helps,
Regards,
Lionel
NB : The process:
<操作符= " true " class = " numerical_to_bin激活ominal" compatibility="8.2.000" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="34">
<康涅狄格州ect from_port="training set" to_op="Sample (2)" to_port="example set input"/>
<康涅狄格州ect from_op="Sample (2)" from_port="example set output" to_op="Decision Tree (2)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_port="model" to_op="Apply Model (4)" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model (4)" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model (4)" from_port="labelled data" to_op="Performance (4)" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance (4)" from_port="performance" to_port="performance 1"/>
<康涅狄格州ect from_port="input 1" to_op="Extract Macro (2)" to_port="example set"/>
<康涅狄格州ect from_op="Extract Macro (2)" from_port="example set" to_op="DT Cross Validation (2)" to_port="example set"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_op="DT Cross Validation (2)" from_port="performance 1" to_port="performance"/>
<康涅狄格州ect from_port="training set" to_op="Sample (3)" to_port="example set input"/>
<康涅狄格州ect from_op="Sample (3)" from_port="example set output" to_op="k-NN (2)" to_port="training set"/>
<康涅狄格州ect from_op="k-NN (2)" from_port="model" to_port="model"/>
<康涅狄格州ect from_port="model" to_op="Apply Model (2)" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model (2)" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model (2)" from_port="labelled data" to_op="Performance (2)" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance (2)" from_port="performance" to_port="performance 1"/>
<康涅狄格州ect from_port="input 1" to_op="Extract Macro (3)" to_port="example set"/>
<康涅狄格州ect from_op="Extract Macro (3)" from_port="example set" to_op="DT Cross Validation (3)" to_port="example set"/>
<康涅狄格州ect from_op="DT Cross Validation (3)" from_port="model" to_port="model"/>
<康涅狄格州ect from_op="DT Cross Validation (3)" from_port="performance 1" to_port="performance"/>
<康涅狄格州ect from_port="train 1" to_op="Decision Tree (5)" to_port="training set"/>
<康涅狄格州ect from_port="train 2" to_op="Naive Bayes (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 3" to_op="Rule Induction (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 4" to_op="Random Tree (3)" to_port="training set"/>
<康涅狄格州ect from_port="train 5" to_op="k-NN (5)" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree (5)" from_port="model" to_port="model 1"/>
<康涅狄格州ect from_op="Naive Bayes (3)" from_port="model" to_port="model 2"/>
<康涅狄格州ect from_op="Rule Induction (3)" from_port="model" to_port="model 3"/>
<康涅狄格州ect from_op="Random Tree (3)" from_port="model" to_port="model 4"/>
<康涅狄格州ect from_op="k-NN (5)" from_port="model" to_port="model 5"/>
<康涅狄格州ect from_port="training set" to_op="Extract Macro" to_port="example set"/>
<康涅狄格州ect from_op="Extract Macro" from_port="example set" to_op="Sample" to_port="example set input"/>
<康涅狄格州ect from_op="Sample" from_port="example set output" to_op="Decision Tree" to_port="training set"/>
<康涅狄格州ect from_op="Decision Tree" from_port="model" to_port="model"/>
<康涅狄格州ect from_port="model" to_op="Apply Model" to_port="model"/>
<康涅狄格州ect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<康涅狄格州ect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<康涅狄格州ect from_op="Performance" from_port="performance" to_port="performance 1"/>
<康涅狄格州ect from_op="Read CSV" from_port="output" to_op="Rename" to_port="example set input"/>
<康涅狄格州ect from_op="Rename" from_port="example set output" to_op="Numerical to Binominal" to_port="example set input"/>
<康涅狄格州ect from_op="Numerical to Binominal" from_port="example set output" to_op="Set Role" to_port="example set input"/>
<康涅狄格州ect from_op="Set Role" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<康涅狄格州ect from_op="Select Attributes" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 1" to_op="Compare ROCs (3)" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 2" to_port="result 4"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 3" to_op="kNN Optimize Parameters (2)" to_port="input 1"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 4" to_op="Cross Validation" to_port="example set"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 5" to_port="result 8"/>
<康涅狄格州ect from_op="Multiply (2)" from_port="output 6" to_op="DT Optimize Parameters" to_port="input 1"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="performance" to_port="result 9"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="model" to_port="result 10"/>
<康涅狄格州ect from_op="DT Optimize Parameters" from_port="parameter set" to_port="result 11"/>
<康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="performance" to_port="result 5"/>
<康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="model" to_port="result 6"/>
<康涅狄格州ect from_op="kNN Optimize Parameters (2)" from_port="parameter set" to_port="result 7"/>
<康涅狄格州ect from_op="Compare ROCs (3)" from_port="rocComparison" to_port="result 1"/>
<康涅狄格州ect from_op="Cross Validation" from_port="model" to_port="result 2"/>
<康涅狄格州ect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>1
Answers
Hey@lionelderkrikor,
thank you for your support.
I was already worried about the model.
我测试你的解决方案与宏观sample-operator.
精度迅速增加。
With your solution i tested a few other things and see, that the cross validation was a important fact for decreasing the accuracy.
Now I am working with the Optimize Parameters and directly below with your solution and the model without the cross validation.
Again: Thank you very much.
Best Regards
car8