Batch-X-Validation(弃用)

Synopsis

This operator performs a cross-validation in order to estimate the statistical performance of a learning operator (usually on unseen data sets). It is mainly used to estimate how accurately a model will perform in practice. This operator does not split the ExampleSet randomly; it splits the ExampleSet on the basis of predefined batches.

Description

The Batch-X-Validation operator is a nested operator. It has two subprocesses: a training subprocess and a testing subprocess. The training subprocess is used for training a model. The trained model is then applied in the testing subprocess. The performance of the model is also measured during the testing phase.

This operator does not split the given ExampleSet randomly like theX-Validationoperator. Instead, the ExampleSet is split on the basis of the attribute with batch role. All examples that have the same value of the batch attribute are considered to be one subset. Suppose there arekunique values of the batch attribute, which implies that there areksubsets of the ExampleSet. Of theksubsets, a single subset is retained as the testing data set (i.e. input of the testing subprocess), and the remainingk − 1subsets are used as training data set (i.e. input of the training subprocess). The cross-validation process is then repeatedktimes, with each of theksubsets used exactly once as the testing data. Thekresults from thekiterations then can be averaged (or otherwise combined) to produce a single estimation. In the X-Validation operator the value ofkcan be adjusted using thenumber of validationsparameter. But in the Batch-X-Validation operator this value is taken from the given ExampleSet (i.e. number of unique batch values). The batch attribute can be an arbitrary nominal or integer attribute where each possible value occurs at least once (since many learning schemes depend on this minimum number of examples).

Usually the learning process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of testing data, it will generally turn out that the model does not fit the testing data as well as it fits the training data. This is called 'over-fitting', and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Cross-validation is a way to predict the fit of a model to a hypothetical testing set when an explicit testing set is not available.

Differentiation

Split Validation

In contrast to the usualX-Validationoperator, the Batch-X-Validation operator does not (randomly) split the data itself but uses the partition defined by the special attribute with batch role. This can be an arbitrary nominal or integer attribute where each possible value occurs at least once (since many learning schemes depend on this minimum number of examples).

Input

training

This input port expects an ExampleSet for training a model (training data set). The same ExampleSet will be used during the testing subprocess for testing the model.

Output

model

The training subprocess must return a model, which is trained on the input ExampleSet. Please note that a model built on the complete input ExampleSet is delivered from this port.

training

The ExampleSet that was given as input at thetraininginput port is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

averagable

The testing subprocess must return a Performance Vector. This is usually generated by applying the model and measuring its performance. Two such ports are provided but more can also be used if required. Please note that the statistical performance calculated by this estimation scheme is only an estimate (instead of an exact calculation) of the performance which would be achieved with the model built on the complete delivered data set.

Parameters

Average performances only

This is an expert parameter which indicates if only performance vectors should be averaged or all types of averagable result vectors.