Subgroup Discovery

Synopsis

This operator performs an exhaustive subgroup discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual.

Description

This operator discovers subgroups (or induces a rule set) by generating hypotheses exhaustively. Generation is done by stepwise refining the empty hypothesis (which contains no literals). The loop for this task hence iterates over the depth of the search space, i.e. the number of literals of the generated hypotheses. The maximum depth of the search can be specified by themax depthparameter. Furthermore the search space can be pruned by specifying a minimum coverage (by themin coverageparameter) of the hypothesis or by using only a given amount of hypotheses which have the highest coverage. From the hypotheses, rules are derived according to the user's preference. This operator allows the derivation of positive rules and negative rules separately or the combination by deriving both rules or only the one which is the most probable due to the examples covered by the hypothesis (hence: the actual prediction for that subset). This behavior can be controlled by therule generationparameter. All generated rules are evaluated on the ExampleSet by a user specified utility function (which is specified by theutility functionparameter) and stored in the final rule set if:

They exceed a minimum utility threshold (which is specified by themin utilityparameter) or
They are among the k best rules (where k is specified by thek best rulesparameter).

modeparameter.

子群di的问题scovery has been defined as follows: Given a population of individuals and a property of those individuals we are interested in finding population subgroups that are statistically most interesting, e.g. are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest. In subgroup discovery, rules have the formClass >- Cond, where the property of interest for subgroup discovery is the class valueClasswhich appears in the rule consequent, and the rule antecedentCondis a conjunction of features (attribute-value pairs) selected from the features describing the training instances. As rules are induced from labeled training instances (labeled positive if the property of interest holds, and negative otherwise), the process of subgroup discovery is targeted at uncovering properties of a selected target population of individuals with the given property of interest. In this sense, subgroup discovery is a form of supervised learning. However, in many respects subgroup discovery is a form of descriptive induction as the task is to uncover individual interesting patterns in data.

孔蒂规则学习是最常用的xt of classification rule learning and association rule learning. While classification rule learning is an approach to predictive induction (or supervised learning), aimed at constructing a set of rules to be used for classification and/or prediction, association rule learning is a form of descriptive induction (non- classification induction or unsupervised learning), aimed at the discovery of individual rules which define interesting patterns in data.

Let us emphasize the difference between subgroup discovery (as a task at the intersection of predictive and descriptive induction) and classification rule learning (as a form of predictive induction). The goal of standard rule learning is to generate models, one for each class, consisting of rule sets describing class characteristics in terms of properties occurring in the descriptions of training examples. In contrast, subgroup discovery aims at discovering individual rules or 'patterns' of interest, which must be represented in explicit symbolic form and which must be relatively simple in order to be recognized as actionable by potential users. Moreover, standard classification rule learning algorithms cannot appropriately address the task of subgroup discovery as they use the covering algorithm for rule set construction which hinders the applicability of classification rule induction approaches in subgroup discovery. Subgroup discovery is usually seen as different from classification, as it addresses different goals (discovery of interesting population subgroups instead of maximizing classification accuracy of the induced rule set).

Input

training set

This input port expects an ExampleSet. It is the output of the Generate Nominal Data operator in the attached Example Process. The output of other operators can also be used as input.

Output

model

The Rule Set is delivered from this output port.

example set

The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

Mode

This parameter specifies the discovery mode.

minimum_utility: If this option is selected the rules are stored in the final rule set if they exceed the minimum utility threshold specified by themin utilityparameter
k_best_rules: If this option is selected the rules are stored in the final rule set if they are among the k best rules (where k is specified by thek best rulesparameter).

Utility function

This parameter specifies the desired utility function.

Min utility

This parameter specifies the minimum utility. This parameter is useful when themodeparameter is set to 'minimum utility'. The rules are stored in the final rule set if they exceed the minimum utility threshold specified by this parameter.

K best rules

This parameter specifies the number of required best rules. This parameter is useful when themodeparameter is set to 'k best rules'. The rules are stored in the final rule set if they are among thekbest rules wherekis specified by this parameter.

Rule generation

This parameter determines which rules should be generated. This operator allows the derivation of positive rules and negative rules separately or the combination by deriving both rules or only the one which is the most probable due to the examples covered by the hypothesis (hence: the actual prediction for that subset).

Max depth

This parameter specifies the maximum depth of breadth-first search. The loop for this task iterates over the depth of the search space, i.e. the number of literals of the generated hypotheses. The maximum depth of the search can be specified by this parameter

Min coverage

This parameter specifies the minimum coverage. Only the rules which exceed this coverage threshold are considered.

Max cache

This parameter bounds the number of rules which are evaluated (only the most supported rules are used).