FP-Growth

Synopsis

This Operator efficiently calculates all frequently-occurring itemsets in an ExampleSet, using the FP-tree data structure.

Description

When online shopping, you will sometimes get a suggestion of the following form: "Customers who bought item X also bought item Y." This suggestion is an example of an association rule. To derive it, you first have to know which items on the market most frequently co-occur in customers' shopping baskets, and here the FP-Growth algorithm has a role to play.

The FP-Growth algorithm is an efficient algorithm for calculating frequently co-occurring items in a transaction database. To understand how it works, let's start with some terminology, using a customer transaction as an example:

item- any object that is sold on the market
篮子- a container for one or more items selected by the customer
itemset- any subset of items that are sold together, in the same shopping basket
transaction- the complete set of items in an individual shopping basket, at the moment of purchase
transaction database- the complete set of shopping baskets / transactions recorded by the merchant

Here, the words "basket" and "transaction" are used interchangably, because we identify the customer's shopping basket with the items that were purchased. To make these definitions concrete, consider the following transaction database:

transaction1 = (product1, product2, product7)
transaction2 = (product2, product5, product7)
transaction3 = (product6, product7, product8, product9)
transaction4 = (product1, product3, product4, product6, product7)

Nine distinct items are for sale, and there are four baskets / transactions, with a varying number of items. The item appearing most frequently, product7, appears four times in the transactions database. Each of the following itemsets occurs twice: (product1, product7), (product2, product7), (product6, product7).

An FP-tree data structure can be efficiently created, compressing the data so much that, in many cases, even large databases will fit into main memory. In the example above, the FP-tree would have product7, the most frequently occurring product, next to the root, with branches from product7 to product1, product2, and product6. If we insist that a product must appear more than once in the transaction database, then the remaining products are excluded from the FP-tree. The transaction database might have started out as a 4 x 9 (transactions x products) data table, with many zero entries, but now it is reduced to a minimalistic tree that captures only the relevant frequency data.

Even with an efficient tree structure, the number of itemsets considered by the algorithm can grow very large. With the help of the parametermax number of itemsets, you can if necessary reduce runtime and memory.

Remember that online shopping is merely an example; the FP-Growth algorithm can be applied to any problem that can be formulated in terms of items, itemsets, and baskets / transactions. The typical setting for the algorithm is a large transaction database (many baskets), with only a small number of items in each basket -- small compared to the set of all items.

support= (Number of times an item or itemset appears in the database) / (Number of baskets in the database)

In general, the concept of "minimum support" creates a cutoff, defining what is meant by frequent or not-so-frequent occurrences of an itemset. If an item or an itemset appears in only a few baskets, it is excluded, via the parametersmin supportormin frequency. The exclusion of infrequently-occurring items and itemsets helps to compress the data and improves the statistical significance of the results. On the other hand, if the value formin supportormin frequencyis set too high, the algorithm may find zero itemsets. Hence, this Operator provides two major modes, via the checkboxfind min number of itemsets:

if unchecked, with a fixed minimum support value, and
if checked, with a dynamic minimum support value, to ensure that the result includes a minimum number of itemsets.

FP-Growth supports several different formats for the input data. Please note the following requirements:

in the ExampleSet, one transaction = one row. For a discussion of the columns, see below.
all the item values must be nominal
a transaction ID is optional and, if present, it should have the "id" role, so that it is not identified as an item.

For the columns, the three available input formats are illustrated in the second tutorial, together with necessary pre-processing. Here's the summary:

item list in a column: All the items belonging to a transaction appear in a single column, separated byitem separators, in a CSV-like format. As with CSV files, the items can be quoted, and escape characters are available. You cantrim item names.
items in separate columns: All the items belonging to a transaction appear in separate columns. For each transaction, the first item name appears in the first column, the second item name in the second column, etc. The number of columns corresponds to the basket with the maximum number of items. Missing values indicate no item. You cantrim item names.
items in dummy coded columns: Every item in the set of all items has its own column, and the item name is the column name. For each transaction, the binominal values (true/false) indicate whether the item can be found in the basket. If your data is binominal but does not identify the values as true/false, you may have to set thepositive valueparameter.

Input

example set

This input port expects an ExampleSet. As discussed in detail in the description, this Operator supports several different formats for the input data.

Output

example set

The ExampleSet that was given as input is passed through without changes.

frequent sets

The frequently-occurring itemsets are delivered through this port. Operators such as Create Association Rules can use these frequently-occuring itemsets to generate association rules.

Parameters

Input format

See the second tutorial for examples. As discussed in detail in the description, this Operator supports several different formats for the input data.

item list in a column: All the items belonging to a transaction appear in a single column, separated byitem separators, in a CSV-like format.
items in separate columns: All the items belonging to a transaction appear in separate columns, with the first item name appearing in the first column, the second item name in the second column, etc.
items in dummy coded columns: Every item in the set of all items has its own column, and the item name is the column name. For each transaction, the binominal values (true/false) indicate whether the item can be found in the basket.

Item separators

This parameter defines the item separator. It can also be provided as a regular expression.

Use quotes

Check this parameter to define aquotes character. As in CSV files, ifitem separatorsare likely to appear in the item name, quotes can be used to prevent confusion. For example if (,) is the item separator and (") is thequotes character, then the row (a,b,c,d) will be interpreted as 4 items. On the other hand, ("a,b,c,d") will be interpreted as a single item, with value a,b,c,d.

Quotes character

This parameter defines thequotes characterand is only available ifuse quotes检查。

Escape character

This parameter defines theescape character, used to escape thequotes characteror the item separator. For example, if (") is thequotes characterand ('\') is theescape character, then ("yes") is interpreted as (yes) and (\"yes\") is interpreted as ("yes"). If ('|') is the item separator and ('\') is theescape character, then a row (a|b|c) is interpreted as two items, (a|b) and (c).

Trim item names

If this parameter is checked, whitespace at the beginning and the end of item names is deleted.

Positive value

In the case ofitems in dummy coded columns, with binominal Attributes, this parameter determines which value should be treated as positive, and hence which items belong to a transaction. If this parameter is left blank, the positive value is inferred from the ExampleSet.

Min requirement

This parameter makes available two different methods for defining a cutoff, eliminating infrequently-occurring itemsets.

support: The minimum support value (ratio of occurrences to ExampleSet size)
frequency:最低频率(occ的数量urrences)

Min support

Minimum support = (number of occurrences of an itemset) / (size of the ExampleSet)

Decrease this value to increase the number of itemsets in the result.

Min frequency

Minimum frequency = number of occurrences of an itemset

Decrease this value to increase the number of itemsets in the result.

Min items per itemset

The lower bound for the size of an itemset.

Max items per itemset

The upper bound for the size of an itemset (0: no upper bound).

Max number of itemsets

The upper bound for the number of itemsets (0: no upper bound). If you run out of memory, either decrease this value or increase the value formin supportormin frequency.

Find min number of itemsets

If this parameter is checked, the results will contain at least aminimum number of itemsets, those with highest support. The minimum support value is automatically decreased until the minimum number of itemsets is found.

Min number of itemsets

这个参数是只有当find min number of itemsets检查。This parameter specifies the minimum number of itemsets that should be included in the results.

Max number of retries

这个参数是只有当find min number of itemsets检查。When automatically decreasing the value for minimum support / minimum frequency, this parameter determines how many times the Operator may decrease the value before giving up. Increase this number to get more results.

需要ment decrease factor

这个参数是只有当find min number of itemsets检查。When automatically decreasing the value for minimum support / minimum frequency, this multiplicative factor determines the new cutoff value. A lower value results in fewer steps to find the desired number of itemsets.

Must contain list

This parameter specifies items that must be included in the frequently-occurring itemsets, if any, via a list of exact item names.

Must contain regexp

This parameter specifies items that must be included in the frequently-occurring itemsets, if any, via a regular expression.