Skip to main content

Principal Component Analysis

Synopsis

This operator performs a Principal Component Analysis (PCA) using the covariance matrix. The user can specify the amount of variance to cover in the original data while retaining the best number of principal components. The user can also specify manually the number of principal components.

Description

Principal component analysis (PCA) is an attribute reduction procedure. It is useful when you have obtained data on a number of attributes (possibly a large number of attributes), and believe that there is some redundancy in those attributes. In this case, redundancy means that some of the attributes are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed attributes into a smaller number of principal components (artificial attributes) that will account for most of the variance in the observed attributes.

Principal Component Analysis is a mathematical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated attributes into a set of values of uncorrelated attributes called principal components. The number of principal components is less than or equal to the number of original attributes. This transformation is defined in such a way that the first principal component's variance is as high as possible (accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it should be orthogonal to (uncorrelated with) the preceding components.

Please note that PCA is sensitive to the relative scaling of the original attributes. This means that whenever different attributes have different units (like temperature and mass); PCA is a somewhat arbitrary method of analysis. Different results would be obtained if one used Fahrenheit rather than Celsius for example.

Input

example set

This input port expects an ExampleSet. It is output of the Retrieve operator in the attached Example Process. The output of other operators can also be used as input. It is essential that meta data should be attached with the data for the input because attributes are specified in their meta data. The Retrieve operator provides meta data along with the data. Please note that this operator cannot handle nominal attributes; it works on numerical attributes.

Output

example set

The Principal Component Analysis is performed on the input ExampleSet and the resultant ExampleSet is delivered through this port.

original

The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

preprocessing model

This port delivers the preprocessing model, which has information regarding the parameters of this operator in the current process.

Parameters

Dimensionality reduction

This parameter indicates which type of dimensionality reduction (reduction in number of attributes) should be applied.

  • none: if this option is selected, no component is removed from the ExampleSet.
  • keep_variance: if this option is selected, all the components with a cumulative variance greater than the given threshold are removed from the ExampleSet. The threshold is specified by thevariance thresholdparameter.
  • fixed_number: if this option is selected, only a fixed number of components are kept. The number of components to keep is specified by thenumber of componentsparameter.

Variance threshold

This parameter is available only when thedimensionality reductionparameter is set to 'keep variance'. All the components with a cumulative variance greater than thevariance thresholdare removed from the ExampleSet.

Number of components

This parameter is only available when thedimensionality reductionparameter is set to 'fixed number'. The number of components to keep is specified by thenumber of componentsparameter.