Unsupervised Feature Selection

Synopsis

This operator performs a fully automated feature selection for centroid-based clustering techniques like k-Means.

Description

This is a new operator for simpler automatic feature selection for unsupervised learning. It provides much simpler settings and is more robust compared to the existing feature engineering operators. This operator also supports multi-objective feature selection and allows to define a balance value between 1 (few features) and 0 (most features, i.e. the cluster model which is closest to the original cluster model using all the input data). Based on this setting the final solution will be selected from the Pareto front. As a rule of thumb, a value of 0.5 roughly brings the number of features down to half.

IMPORTANT: Unlike other optimization operators in RapidMiner, this one only works for a specific cluster validation measurement and only for centroid-cluster models. Therefore, it requires that the inner process delivers such a cluster model together with the clustered data. Those outputs are for example directly generated by the k-Means operator.

The two basic working modes are "no selection" and "selection". In the first mode, the resulting feature set describes the complete input example set. In the second mode, the resulting feature sets describes a subset of the input features. In both cases, other data sets (like scoring or validation data) can be brought to the same format by using the operator Apply Feature Set.

The operator uses a multi-objective evolutionary algorithm for finding the best feature sets. Each feature set is pareto-optimal with respect to complexity vs. model validation. The complexity is calculated based on the feature set where each feature in the set contributes complexity one. The cluster model performance is measured by the Davies Bouldin index which is automatically calculated by this operator. Better cluster separations are indicated by lower values for the Davies Bouldin index.

第一个输出是最好的的特性集Pareto set according to the balancing parameter. The second output is the complete final population of the optimiation run, i.e. the full Pareto-front of all optimal trade-offs between complexity and model errors. Finally, the log data of best error rates, smallest feature set, and largest feature set size for all generations are also delivered for plotting purposes.

Input

example set in

This input port expects a data set which is used as training data to create the best feature set.

Output

feature set

The resulting optimal feature set selected from the optimal trade-offs based on the balance parameter.

population

All optimal trade-offs between error rates and complexity.

optimization log

A table with log data about the optimization run.

Parameters

Mode

The mode for the feature engineering: keep all original features or feature selection.

Balance for accuracy

Defines a balance between 1 (few features) and 1 (most features) to pick the final solution.

Show progress dialog

Indicates if a dialog should be shown during the optimization offering details about the optimization progress. This should not be used if the process is run on systems without graphical user interface but can be useful during process testing.

Use optimization heuristics

Indicates if heuristics should be used to determine a good population size and maximum number of generations.

Use time limit

Indicates if a time limit should be used to stop the optimization.

Time limit in seconds

The number of seconds after the optimization will be stopped.