Skip to main content

Optimize Selection

Synopsis

This operator selects the most relevant attributes of the given ExampleSet. Two deterministic greedy feature selection algorithms 'forward selection' and 'backward elimination' are used for feature selection.

Description

特征选择问题即为最elevant features for classification or regression problems, is one of the main data mining tasks. A wide range of search methods have been integrated into RapidMiner including evolutionary algorithms. For all search methods we need a performance measurement which indicates how well a search point (a feature subset) will probably perform on the given data set.

A deterministic algorithm is an algorithm which, in informal terms, behaves predictably. Given a particular input, it will always produce the same output, and the underlying machine will always pass through the same sequence of states.

A greedy algorithm is an algorithm that follows the problem solving heuristic of making the locally optimal choice at each stage with the hope of finding a global optimum. On some problems, a greedy strategy may not produce an optimal solution, but nonetheless a greedy heuristic may yield locally optimal solutions that approximate a global optimal solution.

This operator realizes the two deterministic greedy feature selection algorithms 'forward selection' and 'backward elimination'. However, we have added some enhancements to the standard algorithms which are described below:

Forward Selection

  1. Create an initial population withnindividuals wherenis the number of attributes in the input ExampleSet. Each individual will use exactly one of the features.
  2. Evaluate the attribute sets and select only the bestk.
  3. For each of thekattribute sets do: If there arejunused attributes, makejcopies of the attribute set and add exactly one of the previously unused attributes to the attribute set.
  4. As long as the performance improved in the lastpiterations go to step 2

Backward Elimination

  1. Start with an attribute set which uses all features.
  2. Evaluate all attribute sets and select the bestk.
  3. For each of thekattribute sets do: If there arejattributes used, makejcopies of the attribute set and remove exactly one of the previously used attributes from the attribute set.
  4. As long as the performance improved in the lastpiterations go to step 2

The parameterkcan be specified by thekeep bestparameter, the parameterpcan be specified by thegenerations without improvalparameter. These parameters have default values 1 which means that the standard selection algorithms are used. Using other values increases the runtime but might help to avoid local extrema in the search for the global optimum.

Another unusual parameter is themaximum number of generationsparameter. This parameter bounds the number of iterations to this maximum of feature selections / de-selections. In combination with thegenerations without improvalparameter, this allows several different selection schemes (which are described for forward selection, backward elimination works analogous):

maximum number of generations = m and generations without improval = p:

Selects maximalmfeatures. The selection stops if no performance improvement was measured in the lastpgenerations.

maximum number of generations = -1 and generations without improval = p:

Tries to selects new features until no performance improvement was measured in the lastpgenerations.

maximum number of generations = m and generations without improval = -1:

Selects maximalmfeatures. The selection stops is not stopped until all combinations with maximalmwere tried. However, the result might contain fewer features than these.

maximum number of generations = -1 and generations without improval = -1:

Test all combinations of attributes (brute force, this might take a very long time and should only be applied to small attribute sets).

Differentiation

Optimize Selection (Evolutionary)

This is also an attribute set reduction operator but it uses a genetic algorithm for this purpose.

Input

example set in

This input port expects an ExampleSet. This ExampleSet is available at the first port of the nested chain (inside the subprocess) for processing in the subprocess.

through

This operator can have multiplethroughports. When one input is connected with thethroughport, anotherthroughport becomes available which is ready to accept another input (if any). The order of inputs remains the same. The Object supplied at the firstthroughport of this operator is available at the firstthroughport of the nested chain (inside the subprocess). Do not forget to connect all inputs in correct order. Make sure that you have connected the right number of ports at the subprocess level.

Output

example set out

The feature selection algorithm is applied on the input ExampleSet. The resultant ExampleSet with reduced attributes is delivered through this port.

weights

The attribute weights are delivered through this port.

performance

This port delivers the Performance Vector for the selected attributes. A Performance Vector is a list of performance criteria values.

Parameters

Selection direction

This parameter specifies which of the 'forward selection' and 'backward elimination' algorithms should be used.

Limit generations without improval

This parameter indicates if the optimization should be aborted if this number of generations showed no improvement. If unchecked, always the maximal number of generations will be used.

Generations without improval

This parameter is only available when thelimit generations without improvalparameter is set to true. This parameter specifies the stop criterion for early stopping i.e. it stops afterngenerations without improvement in the performance.nis specified by this parameter.

限制数量的代

This parameter indicates if the number of generations should be limited to a specific number.

Keep best

The bestnindividuals are kept in each generation wherenis the value of this parameter.

Maximum number of generations

This parameter is only available when the限制数量的属tionsparameter is set to true. This parameter specifies the number of generations after which the algorithm should be terminated.

Normalize weights

This parameter indicates if the final weights should be normalized. If set to true, the final weights are normalized such that the maximum weight is 1 and the minimum weight is 0.

Use local random seed

This parameter indicates if alocal random seedshould be used for randomization. Using the same value oflocal random seedwill produce the same randomization.

Local random seed

This parameter specifies thelocal random seedand is only available if theuse local random seedparameter is set to true.

Show stop dialog

This parameter determines if a dialog with astopbutton should be displayed which stops the search for the best feature space. If the search for best feature space is stopped, the best individual found till then will be returned.

User result individual selection

If this parameter is set to true, it allows the user to select the final result individual from the last population.

Show population plotter

This parameter determines if the current population should be displayed in the performance space.

Plot generations

This parameter is only available when theshow population plotterparameter is set to true. The population plotter is updated in these generations.

Constraint draw range

This parameter is only available when theshow population plotterparameter is set to true. This parameter determines if the draw range of the population plotter should be constrained between 0 and 1.

Draw dominated points

This parameter is only available when theshow population plotterparameter is set to true. This parameter determines if only points which are not Pareto dominated should be drawn on the population plotter.

Population criteria data file

This parameter specifies the path to the file in which the criteria data of the final population should be saved.

Maximal fitness

This parameter specifies the maximal fitness. The optimization will stop if the fitness reaches this value.

Optimize Selection (Evolutionary)