Optimize Selection (Evolutionary)
Synopsis
This operator selects the most relevant attributes of the given ExampleSet. A Genetic Algorithm is used for feature selection.
Description
Feature selection i.e. the question for the most relevant features for classification or regression problems, is one of the main data mining tasks. A wide range of search methods have been integrated into RapidMiner including evolutionary algorithms. For all search methods we need a performance measurement which indicates how well a search point (a feature subset) will probably perform on the given data set.
A genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.
In genetic algorithm for feature selection 'mutation' means switching features on and off and 'crossover' means interchanging used features. Selection is done by the specified selection scheme which is selected by theselection schemeparameter. A genetic algorithm works as follows:
Generate an initial population consisting ofpindividuals. Each attribute is switched on with probabilityp_i.The numberspandp_ican be adjusted by thepopulation sizeandp initializeparameters respectively.
For all individuals in the population
- Perform mutation, i.e. set used attributes to unused with probabilityp_mand vice versa. The probabilityp_mcan be adjusted by thep mutationparameter.
- Choose two individuals from the population and perform crossover with probabilityp_c.The probabilityp_ccan be adjusted by thep crossoverparameter. The type of crossover can be selected by thecrossover typeparameter.
- Perform selection, map all individuals according to their fitness and drawpindividuals at random according to their probability wherepis the population size which can be adjusted by thepopulation sizeparameter.
- As long as the fitness improves, go to step number 2.
If the ExampleSet contains value series attributes with block numbers, the whole block will be switched on and off. Exact, minimum or maximum number of attributes in combinations to be tested can be specified by the appropriate parameters. Many other options are also available for this operator. Please study the parameters section for more information.
Input
example set in
This input port expects an ExampleSet. This ExampleSet is available at the first port of the nested chain (inside the subprocess) for processing in the subprocess.
attribute weights in
This port expects attribute weights. It is not compulsory to use this port.
通过
This operator can have multiple通过ports. When one input is connected with the通过port, another通过port becomes available which is ready to accept another input (if any). The order of inputs remains the same. The Object supplied at the first通过port of this operator is available at the first通过port of the nested chain (inside the subprocess). Do not forget to connect all inputs in correct order. Make sure that you have connected right number of ports at subprocess level.
Output
example set out
The genetic algorithm is applied on the input ExampleSet. The resultant ExampleSet with reduced attributes is delivered through this port.
weights
The attribute weights are delivered through this port.
performance
This port delivers the Performance Vector for the selected attributes. A Performance Vector is a list of performance criteria values.
Parameters
Use exact number of attributes
This parameter determines if only combinations containing exact numbers of attributes should be tested. The exact number is specified by theexact number of attributesparameter.
Exact number of attributes
这个参数只是available when theuse exact number of attributesparameter is set to true. Only combinations containing this numbers of attributes would be generated and tested.
Restrict maximum
If set to true, the maximum number of attributes whose combinations will be generated and tested can be restricted. Otherwise all combinations of all attributes are generated and tested. This parameter is only available when theuse exact number of attributesparameter is set to true.
Min of attributes
This parameter determines the minimum number of features used for the combinations to be generated and tested.
Max number of attributes
This parameter determines the maximum number of features used for the combinations to be generated and tested. This parameter is only available when therestrict maximumparameter is set to true.
Population size
This parameter specifies the population size i.e. the number of individuals per generation.
Maximum number of generations
This parameter specifies the number of generations after which the algorithm should be terminated.
Use early stopping
This parameter enables early stopping. If not set to true, always the maximum number of generations are performed.
Generations without improval
这个参数只是available when theuse early stoppingparameter is set to true. This parameter specifies the stop criterion for early stopping i.e. it stops afterngenerations without improvement in the performance.nis specified by this parameter.
Normalize weights
This parameter indicates if the final weights should be normalized. If set to true, the final weights are normalized such that the maximum weight is 1 and the minimum weight is 0.
Use local random seed
This parameter indicates if alocal random seedshould be used for randomization. Using the same value oflocal random seedwill produce the same randomization.
Local random seed
This parameter specifies thelocal random seed.这个参数只是available if theuse local random seedparameter is set to true.
Show stop dialog
This parameter determines if a dialog with astopbutton should be displayed which stops the search for the best feature space. If the search for best feature space is stopped, the best individual found till then will be returned.
User result individual selection
If this parameter is set to true, it allows the user to select the final result individual from the last population.
Show population plotter
This parameter determines if the current population should be displayed in performance space.
Plot generations
这个参数只是available when theshow population plotterparameter is set to true. The population plotter is updated in these generations.
Constraint draw range
这个参数只是available when theshow population plotterparameter is set to true. This parameter determines if the draw range of the population plotter should be constrained between 0 and 1.
Draw dominated points
这个参数只是available when theshow population plotterparameter is set to true. This parameter determines if only points which are not Pareto dominated should be drawn on the population plotter.
Population criteria data file
This parameter specifies the path to the file in which the criteria data of the final population should be saved.
Maximal fitness
This parameter specifies the maximal fitness. The optimization will stop if the fitness reaches this value.
Selection scheme
This parameter specifies the selection scheme of this evolutionary algorithms.
Tournament size
这个参数只是available when theselection schemeparameter is set to 'tournament'. It specifies the fraction of the current population which should be used as tournament members.
Start temperature
这个参数只是available when theselection schemeparameter is set to 'Boltzmann'. It specifies the scaling temperature.
Dynamic selection pressure
这个参数只是available when theselection schemeparameter is set to 'Boltzmann' or 'tournament'. If set to true the selection pressure is increased to maximum during the complete optimization run.
Keep best individual
If set to true, the best individual of each generations is guaranteed to be selected for the next generation.
Save intermediate weights
This parameter determines if the intermediate best results should be saved.
Intermediate weights generations
这个参数只是available when thesave intermediate weightsparameter is set to true. The intermediate best results would be saved everykgenerations wherekis specified by this parameter.
Intermediate weights file
This parameter specifies the file into which the intermediate weights should be saved.
P initialize
The initial probability for an attribute to be switched on is specified by this parameter.
P mutation
The probability for an attribute to be changed is specified by this parameter. If set to -1, the probability will be set to1/nwherenis the total number of attributes.
P crossover
The probability for an individual to be selected for crossover is specified by this parameter.
Crossover type
可以选择交叉的类型parameter.