Split Validation

Synopsis

这个操作符执行一个简单的验证即randomly splits up the ExampleSet into a training set and test set and evaluates the model. This operator performs a split validation in order to estimate the performance of a learning operator (usually on unseen data sets). It is mainly used to estimate how accurately a model (learnt by a particular learning operator) will perform in practice.

Description

The Split Validation operator is a nested operator. It has two subprocesses: a training subprocess and a testing subprocess. The training subprocess is used for learning or building a model. The trained model is then applied in the testing subprocess. The performance of the model is also measured during the testing phase.

The input ExampleSet is partitioned into two subsets. One subset is used as the training set and the other one is used as the test set. The size of two subsets can be adjusted through different parameters. The model is learned on the training set and is then applied on the test set. This is done in a single iteration, as compared to the Cross Validationoperator that iterates a number of times using different subsets for testing and training purposes.

Usually the learning process optimizes the model parameters to make the model fit the training data as well as possible. If we then take an independent sample of testing data, it will generally turn out that the model does not fit the testing data as well as it fits the training data. This is called 'over-fitting', and is particularly likely to happen when the size of the training data set is small, or when the number of parameters in the model is large. Split Validation is a way to predict the fit of a model to a hypothetical testing set when an explicit testing set is not available. The Split Validation operator also allows training on one data set and testing on another explicit testing data set.

Input

training example set

This input port expects an ExampleSet for training a model (training data set). The same ExampleSet will be used during the testing subprocess for testing the model if no other data set is provided.

Output

model

The training subprocess must return a model, which is trained on the input ExampleSet. Please note that the model built on the complete input ExampleSet is delivered from this port.

training example set

The ExampleSet that was given as input at thetraininginput port is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

averagable

The testing subprocess must return a Performance Vector. This is usually generated by applying the model and measuring its performance. Two such ports are provided but more can also be used if required. Please note that the performance calculated by this estimation scheme is only an estimate (instead of an exact calculation) of the performance which would be achieved with the model built on the complete delivered data set.

Parameters

Split

This parameter specifies how the ExampleSet should be split

relative: If a relative split is required, the relative size of the training set should be provided in thesplit ratioparameter. Afterwards the relative size of the test set is automatically calculated by subtracting the value of thesplit ratiofrom 1.
absolute: If an absolute split is required, you have to specify the exact number of examples to use in the training or test set in thetraining set sizeparameter or in thetest set sizeparameter. If either of these parameters is set to -1, its value is calculated automatically using the other one.

Split ratio

This parameter is only available when thesplitparameter is set to 'relative'. It specifies the relative size of the training set. It should be between 1 and 0, where 1 means that the entire ExampleSet will be used as training set.

Training set size

This parameter is only available when thesplitparameter is set to 'absolute'. It specifies the exact number of examples to be used as training set. If it is set to -1, thetest size setnumber of examples will be used for the test set and the remaining examples will be used as training set.

Test set size

This parameter is only available when thesplitparameter is set to 'absolute'. It specifies the exact number of examples to be used as test set. If it is set to -1, thetraining size setnumber of examples will be used for training set and the remaining examples will be used as test set.

Sampling type

The Split Validation operator can use several types of sampling for building the subsets. Following options are available:

linear_sampling: The linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
shuffled_samplingt:重组构建随机抽样的子集he ExampleSet. Examples are chosen randomly for making subsets.
stratified_sampling: The stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example, in the case of a binominal classification, stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of classlabels.
automatic: The automated mode uses stratified sampling per default. If it isn't applicable, e.g., if the ExampleSet doesn't contain a nominal label, shuffled sampling will be used instead.

Use local random seed

Indicates if alocal random seedshould be used for randomizing examples of a subset. Using the same value oflocal random seedwill produce the same subsets. Changing the value of this parameter changes the way examples are randomized, thus subsets will have a different set of examples. This parameter is only available if Shuffled or Stratified sampling is selected. It is not available for Linear sampling because it requires no randomization, examples are selected in sequence.

Local random seed

This parameter specifies thelocal random seed. This parameter is only available if theuse local random seedparameter is set to true.