Decision Stump
Synopsis
This operator learns a Decision Tree with only one single split. This operator can be applied on both nominal and numerical data sets.
Description
The Decision Stump operator is used for generating a decision tree with only one single split. The resulting tree can be used for classifying unseen examples. This operator can be very efficient when boosted with operators like the AdaBoost operator. The examples of the given ExampleSet have several attributes and every example belongs to a class (like yes or no). The leaf nodes of a decision tree contain the class name whereas a non-leaf node is a decision node. The decision node is an attribute test with each branch (to another decision tree) being a possible value of the attribute. For more information about decision trees, please study the
operator.
Input
training set
This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.
Output
model
The Decision Tree with just a single split is delivered from this output port. This classification model can now be applied on unseen data sets for the prediction of thelabelattribute.
example set
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Parameters
Criterion
This parameter specifies the criterion on which attributes will be selected for splitting. It can have one of the following values:
- information_gain: The entropy of all the attributes is calculated. The attribute with minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
- gain_ratio: It is a variant of information gain. It adjusts the information gain for each attribute to allow the breadth and uniformity of the attribute values.
- gini_index: This is a measure of impurity of an ExampleSet. Splitting on a chosen attribute gives a reduction in the average gini index of the resulting subsets.
- accuracy: Such an attribute is selected for split that maximizes the accuracy of the whole Tree.
Minimal leaf size
The size of a leaf node is the number of examples in its subset. The tree is generated in such a way that every leaf node subset has at least theminimal leaf sizenumber of instances.