Random Tree

Synopsis

This operator learns a decision tree. This operator uses only a random subset of attributes for each split.

Description

The Random Tree operator works exactly like the Decision Tree operator with one exception: for each split only a random subset of attributes is available. It is recommended that you study the documentation of theDecision Treeoperator for basic understanding of decision trees.

This operator learns decision trees from both nominal and numerical data. Decision trees are powerful classification methods which can be easily understood. The Random Tree operator works similar to Quinlan's C4.5 or CART but it selects a random subset of attributes before it is applied. The size of the subset is specified by thesubset ratioparameter.

Representation of the data as Tree has the advantage compared with other approaches of being meaningful and easy to interpret. The goal is to create a classification model that predicts the value of the label based on several input attributes of the ExampleSet. Each interior node of tree corresponds to one of the input attributes. The number of edges of an interior node is equal to the number of possible values of the corresponding input attribute. Each leaf node represents a value of the label given the values of the input attributes represented by the path from the root to the leaf. This description can be easily understood by studying theExample Processof the Decision Tree operator.

Pruning is a technique in which leaf nodes that do not add to the discriminative power of the decision tree are removed. This is done to convert an over-specific or over-fitted tree to a more general form in order to enhance its predictive power on unseen datasets. Pre-pruning is a type of pruning performed parallel to the tree creation process. Post-pruning, on the other hand, is done after the tree creation process is complete.

Differentiation

Decision Tree

The Random Tree operator works exactly like the Decision Tree operator with one exception: for each split only a random subset of attributes is available.

Input

training set

This input port expects an ExampleSet. It is the output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input.

Output

model

The Random Tree is delivered from this output port. This classification model can now be applied on unseen data sets for the prediction of thelabelattribute.

example set

给定的ExampleSet作为输入传递without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

Criterion

This parameter selects the criterion on which attributes will be selected for splitting. It can have one of the following values:

information_gain: The entropy of all the attributes is calculated. The attribute with minimum entropy is selected for split. This method has a bias towards selecting attributes with a large number of values.
gain_ratio: It is a variant of information gain. It adjusts the information gain for each attribute to allow the breadth and uniformity of the attribute values.
gini_index: This is a measure of impurity of an ExampleSet. Splitting on a chosen attribute gives a reduction in the average gini index of the resulting subsets.
accuracy: Such an attribute is selected for split that maximizes the accuracy of the whole Tree.

Minimal size for split

The size of a node in a Tree is the number of examples in its subset. The size of the root node is equal to the total number of examples in the ExampleSet. Only those nodes are split whose size is greater than or equal to theminimal size for splitparameter.

Minimal leaf size

The size of a leaf node in a Tree is the number of examples in its subset. The tree is generated in such a way that every leaf node subset has at least theminimal leaf sizenumber of instances.

Minimal gain

The gain of a node is calculated before splitting it. The node is split if its Gain is greater than theminimal gain. Higher value of minimal gain results in fewer splits and thus a smaller tree. A too high value will completely prevent splitting and a tree with a single node is generated.

Maximal depth

The depth of a tree varies depending upon size and nature of the ExampleSet. This parameter is used to restrict the size of the Tree. The tree generation process is not continued when the tree depth is equal to themaximal depth. If its value is set to '-1', themaximal depthparameter puts no bound on the depth of the tree, a tree of maximum depth is generated. If its value is set to '1', a Tree with a single node is generated.

Confidence

This parameter specifies the confidence level used for the pessimistic error calculation of pruning.

Number of prepruning alternatives

As prepruning runs parallel to the tree generation process, it may prevent splitting at certain nodes when splitting at that node does not add to the discriminative power of the entire tree. In such a case alternative nodes are tried for splitting. This parameter adjusts the number of alternative nodes tried for splitting when split is prevented by prepruning at a certain node.

No prepruning

By default the Tree is generated with prepruning. Setting this parameter to true disables the prepruning and delivers a tree without any prepruning.

No pruning

By default the Tree is generated with pruning. Setting this parameter to true disables the pruning and delivers an unpruned Tree.

Guess subset ratio

This parameter specifies if the subset ratio should be guessed or not. If set to true,log(m) + 1features are used as subset, otherwise a ratio has to be specified through thesubset ratioparameter.

Subset ratio

This parameter specifies the subset ratio of randomly chosen attributes.

Use local random seed

This parameter indicates if alocal random seedshould be used for randomization. Using the same value of thelocal random seedwill produce the same randomization.

Local random seed

This parameter specifies thelocal random seed. This parameter is only available if theuse local random seedparameter is set to true.