Skip to main content

Correlation Matrix

Synopsis

This Operator determines correlation between all Attributes and it can produce a weights vector based on these correlations. Correlation is a statistical technique that can show whether and how strongly pairs of Attributes are related.

Description

A correlation is a number between -1 and +1 that measures the degree of association between two Attributes (call them X and Y). A positive value for the correlation implies a positive association. In this case large values of X tend to be associated with large values of Y and small values of X tend to be associated with small values of Y. A negative value for the correlation implies a negative or inverse association. In this case large values of X tend to be associated with small values of Y and vice versa.

Suppose we have two Attributes X and Y, with means X' and Y' respectively and standard deviations S(X) and S(Y) respectively. The correlation is computed as summation from 1 to n of the product(X(i)-X').(Y(i)-Y')and then dividing this summation by the product(n-1).S(X).S(Y)wherenis total number of Examples andiis the increment variable of summation. There can be other formulas and definitions but let us stick to this one for simplicity.

As discussed earlier a positive value for the correlation implies a positive association. Suppose that an X value was above average, and that the associated Y value was also above average. Then the product(X(i)-X').(Y(i)-Y')would be the product of two positive numbers which would be positive. If the X value and the Y value were both below average, then the product above would be of two negative numbers, which would also be positive. Therefore, a positive correlation is evidence of a general tendency that large values of X are associated with large values of Y and small values of X are associated with small values of Y.

As discussed earlier a negative value for the correlation implies a negative or inverse association. Suppose that an X value was above average, and that the associated Y value was instead below average. Then the product(X(i)-X').(Y(i)-Y')would be the product of a positive and a negative number which would make the product negative. If the X value was below average and the Y value was above average, then the product above would also be negative. Therefore, a negative correlation is evidence of a general tendency that large values of X are associated with small values of Y and small values of X are associated with large values of Y.

This Operator can be used for creating a correlation matrix that shows correlations of all the Attributes of the input ExampleSet. The Attribute weights vector; based on the correlations can also be returned by this Operator. Using this weights vector, highly correlated Attributes can be removed from the ExampleSet with the help of the Select by Weights Operator. Highly correlated Attributes can be more easily removed by simply using the Remove Correlated Attributes Operator. Correlated Attributes are usually removed because they are similar in behavior and only have little influence when calculating predictions. They may also hamper run time and memory usage.

Input

example set

This input port expects an ExampleSet on which the correlation matrix will be calculated.

Output

example set

The ExampleSet, that was given as input is passed through without changes.

matrix

The correlations of all Attributes of the input ExampleSet are calculated and the resultant correlation matrix is returned from this port. The correlation for nominal Attributes is not well defined and results in a missing value. When Attributes contain missing values, only pairwise complete tuples are used for calculating the correlation.

weights

The Attribute weights vector based on the correlations of the Attributes is delivered through this output port.

Parameters

Attribute filter type

This parameter allows you to select the Attribute selection filter; the method you want to use for selecting Attributes. It has the following options:

  • all: This option selects all the Attributes of the ExampleSet, no Attributes are removed. This is the default option.
  • single: This option allows the selection of a single Attribute. The required Attribute is selected by theattributeparameter.
  • subset: This option allows the selection of multiple Attributes through a list (see parameterattributes). If the meta data of the ExampleSet is known all Attributes are present in the list and the required ones can easily be selected.
  • regular_expression: This option allows you to specify a regular expression for the Attribute selection. The regular expression filter is configured by the parametersregular expression, use except expression and except expression.
  • value_type: This option allows selection of all the Attributes of a particular type. It should be noted that types are hierarchical. For example real and integer types both belong to the numeric type. The value type filter is configured by the parametersvalue type, use value type exception, except value type.
  • block_type: This option allows the selection of all the Attributes of a particular block type. It should be noted that block types may be hierarchical. For example value_series_start and value_series_end block types both belong to the value_series block type. The block type filter is configured by the parametersblock type, use block type exception, except block type.
  • no_missing_values: This option selects all Attributes of the ExampleSet which do not contain a missing value in any Example. Attributes that have even a single missing value are removed.
  • numeric_value_filter: All numeric Attributes whose Examples all match a given numeric condition are selected. The condition is specified by thenumeric conditionparameter. Please note that all nominal Attributes are also selected irrespective of the given numerical condition.

Attribute

The required Attribute can be selected from this option. The Attribute name can be selected from the drop down box of the parameter if the meta data is known.

Attributes

The required Attributes can be selected from this option. This opens a new window with two lists. All Attributes are present in the left list. They can be shifted to the right list, which is the list of selected Attributes that will make it to the output port.

Regular expression

属性的名字匹配表达式be selected. The expression can be specified through theedit and preview regular expressionmenu. This menu gives a good idea of regular expressions and it also allows you to try different expressions and preview the results simultaneously.

Use except expression

If enabled, an exception to the first regular expression can be specified. This exception is specified by theexcept regular expressionparameter.

Except regular expression

This option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified inregular expressionparameter).

Value type

这个选项允许选择一个类型的属性。One of the following types can be chosen: nominal, numeric, integer, real, text, binominal, polynominal, file_path, date_time, date, time.

Use value type exception

If enabled, an exception to the selected type can be specified. This exception is specified by theexcept value typeparameter.

Except value type

The Attributes matching this type will be removed from the final output even if they matched the before selected type, specified by thevalue typeparameter. One of the following types can be selected here: nominal, numeric, integer, real, text, binominal, polynominal, file_path, date_time, date, time.

Block type

This option allows to select a block type of Attribute. One of the following types can be chosen: single_value, value_series, value_series_start, value_series_end, value_matrix, value_matrix_start, value_matrix_end, value_matrix_row_start.

Use block type exception

If enabled, an exception to the selected block type can be specified. This exception is specified by theexcept block typeparameter.

Except block type

The Attributes matching this block type will be removed from the final output even if they matched the before selected type by theblock typeparameter. One of the following block types can be selected here: single_value, value_series, value_series_start, value_series_end, value_matrix, value_matrix_start, value_matrix_end, value_matrix_row_start.

Numeric condition

The numeric condition used by the numeric condition filter type. A numeric Attribute is kept if all Examples match the specified condition for this Attribute. For example the numeric condition '>6' will keep all numeric Attributes having a value of greater than 6 in every Example. A combination of conditions is possible: '>6 &&<11' or '<= 5 ||<0'. But && and || cannot be used together in one numeric condition. Conditions like '(>0 &&<2) || (>10 &&<12)' are not allowed because they use both && and ||. Nominal Attributes are always kept, regardless of the specified numeric condition.

Include special attributes

特殊属性是具有特殊属性的方式es. These are: id, label, prediction, cluster, weight and batch. Also custom roles can be assigned to Attributes. By default all special Attributes are delivered to the output port irrespective of the conditions in the Select Attribute Operator. If this parameter is set to true, special Attributes are also tested against conditions specified in the Select Attribute Operator and only those Attributes are selected that match the conditions.

Invert selection

If this parameter is set to true the selection is reversed. In that case all Attributes matching the specified condition are removed and the other Attributes remain in the output ExampleSet. Special Attributes are kept independent of theinvert selectionparameter as along as theinclude special attributesparameter is not set to true. If so the condition is also applied to the special Attributes and the selection is reversed if this parameter is checked.

Normalize weights

This parameter indicates if the weights of the resultant Attribute weights vector should be normalized. If set to true, all weights are normalized such that the minimum weight is 0 and the maximum weight is 1.

Squared correlation

This parameter indicates if the squared correlation should be calculated. If set to true, the correlation matrix shows squares of correlations instead of simple correlations.