Decision Tree

Synopsis

This Operator generates a decision tree model, which can be used for classification and regression.

Description

这样的决策树是一个树节点的集合intended to create a decision on values affiliation to a class or an estimate of a numerical target value. Each node represents a splitting rule for one specific Attribute. For classification this rule separates values belonging to different classes, for regression it separates them in order to reduce the error in an optimal way for the selected parametercriterion.

The building of new nodes is repeated until the stopping criteria are met. A prediction for the class label Attribute is determined depending on the majority of Examples which reached this leaf during generation, while an estimation for a numerical value is obtained by averaging the values in a leaf.

This Operator can process ExampleSets containing both nominal and numerical Attributes. The label Attribute must be nominal for classification and numerical for regression.

After generation, the decision tree model can be applied to new Examples using the Apply Model Operator. Each Example follows the branches of the tree in accordance to the splitting rule until a leaf is reached.

To configure the decision tree, please read the documentation on parameters as explained below.

Differentiation

CHAID

The CHAID Operator provides a pruned decision tree that uses chi-squared based criterion instead of information gain or gain ratio criteria. This Operator cannot be applied on ExampleSets with numerical Attributes but only nominal Attributes.

ID3

The ID3 Operator provides a basic implementation of unpruned decision tree. It only works with ExampleSets with nominal Attributes.

Random Forest

The Random Forest Operator creates several random trees on different Example subsets. The resulting model is based on voting of all these trees. Due to this difference, it is less prone to overtraining.

Bagging

Bootstrap aggregating (bagging) is a machine learning ensemble meta-algorithm to improve classification and regression models in terms of stability and classification accuracy. It also reduces variance and helps to avoid 'overfitting'. Although it is usually applied to decision tree models, it can be used with any type of model.

Input

training set

我们的输入数据ed to generate the decision tree model.

Output

model

The decision tree model is delivered from this output port.

example set

The ExampleSet that was given as input is passed without changing to the output through this port.

weights

An ExampleSet containing Attributes and weight values, where each weight represents the feature importance for the given Attribute. A weight is given by the sum of improvements the selection of a given Attribute provided at a node. The amount of improvement is dependent on the chosencriterion.

Parameters

Criterion

Selects the criterion on which Attributes will be selected for splitting. For each of these criteria the split value is optimized with regards to the chosen criterion. It can have one of the following values:

information_gain: The entropies of all the Attributes are calculated and the one with least entropy is selected for split. This method has a bias towards selecting Attributes with a large number of values.
gain_ratio: A variant of information gain that adjusts the information gain for each Attribute to allow the breadth and uniformity of the Attribute values.
gini_index: A measure of inequality between the distributions of label characteristics. Splitting on a chosen Attribute results in a reduction in the average gini index of the resulting subsets.
accuracy: An Attribute is selected for splitting, which maximizes the accuracy of the whole tree.
least_square: An Attribute is selected for splitting, that minimizes the squared distance between the average of values in the node with regards to the true value.

Maximal depth

The depth of a tree varies depending upon the size and characteristics of the ExampleSet. This parameter is used to restrict the depth of the decision tree. If its value is set to '-1', themaximal depthparameter puts no bound on the depth of the tree. In this case the tree is built until other stopping criteria are met. If its value is set to '1', a tree with a single node is generated.

Apply pruning

The decision tree model can be pruned after generation. If checked, some branches are replaced by leaves according to theconfidenceparameter.

Confidence

This parameter specifies the confidence level used for the pessimistic error calculation of pruning.

Apply prepruning

This parameter specifies if more stopping criteria than themaximal depthshould be used during generation of the decision tree model. If checked, the parametersminimal gain,minimal leaf size,minimal size for splitandnumber of prepruning alternativesare used as stopping criteria.

Minimal gain

The gain of a node is calculated before splitting it. The node is split if its gain is greater than the minimal gain. A higher value ofminimal gainresults in fewer splits and thus a smaller tree. A value that is too high will completely prevent splitting and a tree with a single node is generated.

Minimal leaf size

The size of a leaf is the number of Examples in its subset. The tree is generated in such a way that every leaf has at least theminimal leaf sizenumber of Examples.

Minimal size for split

The size of a node is the number of Examples in its subset. Only those nodes are split whose size is greater than or equal to theminimal size for splitparameter.

Number of prepruning alternatives

When split is prevented by prepruning at a certain node this parameter will adjust the number of alternative nodes tested for splitting. Occurs as prepruning runs parallel to the tree generation process. This may prevent splitting at certain nodes, when splitting at that node does not add to the discriminative power of the entire tree. In such a case, alternative nodes are tried for splitting.