X-Validation (Deprecated)

Synopsis

This operator performs a cross-validation in order to estimate the statistical performance of a learning operator (usually on unseen data sets). It is mainly used to estimate how accurately a model (learnt by a particular learning operator) will perform in practice.

Description

The X-Validation operator is a nested operator. It has two subprocesses: a training subprocess and a testing subprocess. The training subprocess is used for training a model. The trained model is then applied in the testing subprocess. The performance of the model is also measured during the testing phase.

The input ExampleSet is partitioned intoksubsets of equal size. Of theksubsets, a single subset is retained as the testing data set (i.e. input of the testing subprocess), and the remainingk − 1subsets are used as training data set (i.e. input of the training subprocess). The cross-validation process is then repeatedktimes, with each of theksubsets used exactly once as the testing data. Thekresults from thekiterations then can be averaged (or otherwise combined) to produce a single estimation. The valuekcan be adjusted using thenumber of validationsparameter.

通常学习过程优化模型to make it fit the training data as well as possible. If we test this model on some independent set of data, mostly this model does not perform that well on testing data as it performed on the data that was used to generate it. This is called 'over-fitting'. The Cross-Validation operator predicts the fit of a model to a hypothetical testing data. This can be especially useful when separate testing data is not present.

Differentiation

Explain Predictions

The X-Validation and X-Prediction operators work in the same way. The major difference is the objects returned by these operators. The X-Validation operator returns a performance vector whereas the X-Prediction operator returns labeled ExampleSet.

Input

training example set

This input port expects an ExampleSet for training a model (training data set). The same ExampleSet will be used during the testing subprocess for testing the model.

Output

model

The training subprocess must return a model, which is trained on the input ExampleSet. Please note that model built on the complete input ExampleSet is delivered from this port.

training example set

The ExampleSet that was given as input at thetraininginput port is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

averagable

The testing subprocess must return a Performance Vector. This is usually generated by applying the model and measuring its performance. Two such ports are provided but more can also be used if required. Please note that the statistical performance calculated by this estimation scheme is only an estimate (instead of an exact calculation) of the performance which would be achieved with the model built on the complete delivered data set.

Parameters

Average performances only

This is an expert parameter which indicates if only performance vectors should be averaged or all types of averagable result vectors.

Leave one out

As the name suggests, theleave one outcross-validation involves using a single example from the original ExampleSet as the testing data (in testing subprocess), and the remaining examples as the training data (in training subprocess). This is repeated such that each example in the ExampleSet is used once as the testing data. Thus, it is repeated 'n' number of times, where 'n' is the total number of examples in the ExampleSet. This is the same as applying the X-Validation operator with thenumber of validationsparameter set equal to the number of examples in the original ExampleSet. This is usually very expensive for large ExampleSets from a computational point of view because the training process is repeated a large number of times (number of examples time). If set to true, thenumber of validationsparameter is ignored.

Number of validations

This parameter specifies the number of subsets the ExampleSet should be divided into (each subset has equal number of examples). Also the same number of iterations will take place. Each iteration involves training a model and testing that model. If this is set equal to total number of examples in the ExampleSet, it is equivalent to the X-Validation operator with theleave one outparameter set to true.

Sampling type

The X-Validation operator can use several types of sampling for building the subsets. Following options are available:

linear_sampling: The linear sampling simply divides the ExampleSet into partitions without changing the order of the examples i.e. subsets with consecutive examples are created.
shuffled_samplingt:重组构建随机抽样的子集he ExampleSet. Examples are chosen randomly for making subsets.
stratified_sampling: The stratified sampling builds random subsets and ensures that the class distribution in the subsets is the same as in the whole ExampleSet. For example, in the case of a binominal classification, stratified sampling builds random subsets such that each subset contains roughly the same proportions of the two values of classlabels.
automatic: The automated mode uses stratified sampling per default. If it isn't applicable, e.g., if the ExampleSet doesn't contain a nominal label, shuffled sampling will be used instead.

Use local random seed

This parameter indicates if alocal random seedshould be used for randomizing examples of a subset. Using the same value of thelocal random seedwill produce the same subsets. Changing the value of this parameter changes the way examples are randomized, thus subsets will have a different set of examples. This parameter is available only if Shuffled or Stratified sampling is selected. It is not available for Linear sampling because it requires no randomization, examples are selected in sequence.

Local random seed

This parameter specifies thelocal random seed. This parameter is available only if theuse local random seedparameter is set to true.