Cross Validation

Synopsis

This Operator performs a cross validation to estimate the statistical performance of a learning model.

Description

It is mainly used to estimate how accurately a model (learned by a particular learning Operator) will perform in practice.

The Cross Validation Operator is a nested Operator. It has two subprocesses: a Training subprocess and a Testing subprocess. The Training subprocess is used for training a model. The trained model is then applied in the Testing subprocess. The performance of the model is measured during the Testing phase.

The input ExampleSet is partitioned intoksubsets of equal size. Of theksubsets, a single subset is retained as the test data set (i.e. input of the Testing subprocess). The remainingk - 1subsets are used as training data set (i.e. input of the Training subprocess). The cross validation process is then repeatedktimes, with each of theksubsets used exactly once as the test data. Thekresults from thekiterations are averaged (or otherwise combined) to produce a single estimation. The valuekcan be adjusted using thenumber of foldsparameter.

The evaluation of the performance of a model on independent test sets yields a good estimation of the performance on unseen data sets. It also shows if 'overfitting' occurs. This means that the model represents the testing data very well, but it does not generalize well for new data. Thus, the performance can be much worse on test data.

Differentiation

Split Validation

This Operator is similar to the Cross Validation Operator but only splits the data into one training and one test set. Hence it is similar to one iteration of the cross validation.

Split Data

This Operator splits an ExampleSet into different subsets. It can be used to manual perform a validation.

Bootstrapping Validation

This Operator is similar to the Cross Validation Operator. Instead of splitting the input ExampleSet into different subset, the Bootstrapping Validation Operator uses bootstrapping sampling to get the training data. Bootstrapping sampling is sampling with replacement.

Wrapper Split Validation

This Operator is similar to the Split Validation Operator. It has an additional Attribute Weighting subprocess to evaluate the attribute weighting method individually.

Wrapper-X-Validation

This Operator is similar to the Cross Validation Operator. It has an additional Attribute Weighting subprocess to evaluate the attribute weighting method individually.

Input

example set

This input port receives an ExampleSet to apply the cross validation.

Output

model

This port delivers the prediction model trained on the whole ExampleSet. Please note that this port should only be connected if you really need this model because otherwise the generation will be skipped.

performance

This is an expandable port. You can connect any performance vector (result of a Performance Operator) to the result port of the inner Testing subprocess. The performance output ports of the Cross Validation Operator deliver the average of the performances over thenumber of foldsiterations.

example set

这个港口返回相同的ExampleSet蜜蜂n given as input.

test result set

This port delivers only an ExampleSet if the test set results port of the inner Testing subprocess is connected. If so, the test sets are merged to one ExampleSet and delivered by this port. For example with this output port it is possible to get the labeled test sets, with the results of the Apply Model Operator.

Parameters

Split on batch attribute

If this parameter is enabled, use the Attribute with the special role 'batch' to partition the data instead of randomly splitting the data. This gives you control over the exact Examples which are used to train the model in each fold. All other split parameters are not available in this case.

Leave one out

If this parameter is enabled, the test set (i.e. the input of the Testing subprocess) is only one Example from the original ExampleSet. The remaining Examples are used as the training data. This is repeated such that each Example in the ExampleSet is used once as the test data. Thus it is repeated 'n' times, where 'n' is the total number of Examples in the ExampleSet. The Cross Validation can take a very long time, as the Training and Testing subprocesses are repeated as many times as the number of Example. If set to true, thenumber of foldsparameter is not available.

Number of folds

This parameter specifies the number of folds (number of subsets) the ExampleSet should be divided into. Each subset has equal number of Examples. Also the number of iterations that will take place is the same as thenumber of folds. If the model output port is connected, the Training subprocess is repeated one more time with all Examples to build the final model.

Sampling type

The Cross Validation Operator can use several types of sampling for building the subsets. Following options are available:

linear_sampling: The linear sampling divides the ExampleSet into partitions without changing the order of the Examples. Subsets with consecutive Examples are created.
shuffled_sampling: The shuffled sampling builds random subsets of the ExampleSet. Examples are chosen randomly for making subsets.
stratified_sampling: The stratified sampling builds random subsets. It ensures that the class distribution (defined by the label Attribute) in the subsets is the same as in the whole ExampleSet. For example in the case of a binominal classification, stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of the label Attribute.
automatic: The automated mode uses stratified sampling per default. If it isn't applicable e.g. if the ExampleSet doesn't contain a nominal label, shuffled sampling will be used instead.

Use local random seed

This parameter indicates if alocal random seedshould be used for randomizing Examples of a subset. Using the same value of thelocal random seedwill produce the same subsets. Changing the value of this parameter changes the way Examples are randomized, thus subsets will have a different set of Examples. This parameter is available only if shuffled or stratified sampling is selected. It is not available for linear sampling because it requires no randomization, Examples are selected in sequence.

Local random seed

If theuse local random seedparameter is checked this parameter determines the local random seed. The same subsets will be created every time if the same value is used.

Enable parallel execution

该参数使t的并行执行he inner processes. Please disable the parallel execution if you run into memory problems.