# 5.6.1 Principal Component Analysis

## Summary

Principal Component Analysis is useful for reducing and interpreting large multivariate data sets with underlying linear structures, and for discovering previously unsuspected relationships.

We will start with data measuring protein consumption in twenty-five European countries for nine food groups. Using Principal Component Analysis, we will examine the relationship between protein sources and these European countries.

## Selecting Principal Methods

To determine the number of principal components to be retained, we should first run Principal Component Analysis and then proceed based on its result:

1. Open a new project or a new workbook. Import the data file \samples\Statistics\Protein Consumption in Europe.dat
2. Select the entire worksheet and then select Statistics: Multivariate Analysis: Principal Component Analysis.
3. Accept the default settings in the open dialog box and click OK.
4. Select sheet PCA Report.
5. In the Eigenvalues of the Correlation Matrix table, we can see that the first four principal components explain 86% of the variance and the remaining components each contribute 5% or less. We will keep four main components.
6. A scree plot can be a useful visual aid for determining the appropriate number of principal components. The number of components depends on the "elbow" point at which the remaining eigenvalues are relatively small and all about the same size. This point is not very evident in the scree plot, but we can still say the fourth point is our "elbow" point.
7. Click the lock icon in the results tree and select Change Parameters in the context menu. In the Settings tab, set Number of Components to Extract to 4. Do not close the dialog; in the next steps, we will retrieve component diagrams.

## Request Principal Component Plots

In the Plots tab of the dialog, users can choose whether they want to create a scree plot or a component diagram.

• Scree Plot
The scree plot is a useful visual aid for determining an appropriate number of principal components.
• Component Plot
Component plots show the component score of each observation or component loading of each variable for a pair of principal components. In the Select Principal Components to Plot group, users can specify which pair of components to plot. The component plots include:
The loading plot is a plot of the relationship between the original variables and the subspace dimension. It is used to interpret relationships between variables.
• Score Plot
The score plot is a projection of data onto subspace. It is used to interpret relationships between observations.
• BiPlot
The biplot shows both the loadings and the scores for two selected components in parallel.
1. In the dialog that was opened in the preceding steps, select the Plots tab. Make sure Scree Plot, Loading Plot, and Biplot are selected.
2. The first two components are usually responsible for the bulk of the variance. This is why we are going to plot the component plot in the space of the first two principal components. In the Select Principal Components to Plot group, set Principal Component for X Axis to 1, and set Principal Component for Y Axis to 2. Click OK.

## Interpreting The Results

1. In the Correlation Matrix, we can see that the variables are highly correlated. Many values are greater than 0.3. Principal Component Analysis is an appropriate tool for removing the collinearity.
2. The main component variables are defined as linear combinations of the original variables. The Extracted Eigenvectors table provides coefficients for equations.
$PC1=0.30261*RedMeat + 0.31056*WhiteMeat + 0.42668*Eggs + 0.37773*Milk + 0.13565*Fish - 0.43774*Cereals + 0.29725*Starch - 0.42033*Nuts - 0.11042*FruitsVegetables$
$PC2=-0.05625*RedMeat - 0.23685*WhiteMeat - 0.03534*Eggs - 0.18459Milk + 0.64682*Fish - 0.23349*Cereals + 0.35283*Starch + 0.14331*Nuts + 0.53619*FruitsVegetables$
$PC3=-0.29758*RedMeat + 0.6239*WhiteMeat + 0.18153*Eggs - 0.38566*Milk - 0.32127*Fish + 0.09592*Cereals + 0.24298*Starch - 0.05439*Nuts + 0.40756*FruitsVegetables$
$PC4=0.64648*RedMeat - 0.03699*WhiteMeat + 0.31316Eggs - 0.00332*Milk - 0.21596*Fish - 0.0062*Cereals - 0.33668*Starch + 0.33029*Nuts + 0.46206*FruitsVegetables$
3. The Loading Plot reveals the relationships between variables in the space of the first two components. In the loading plot, we can see that Red Meat, Eggs, Milk, and White Meat have similar heavy loadings for principal component 1. Fish, fruit, and vegetables, however, have similar heavy loadings for principal component 2.
4. The biplot shows both the loadings and the score for two selected components in parallel. It can reveal the projection of an observation on the subspace with the score points. It can also find the ratio of observations and variables in the subspace of the first two components. (Note: Double-click the graph to open and customize.)
5. We can use the Data Reader tool to open the Data Info window and examine the plot in greater detail. Click on a data point to read component scores for each country. We can see that Spain and Portugal's protein sources differ from those of other European countries. Spain and Portugal rely on fruits and vegetables, while eastern European countries such as Albania, Bulgaria, Yugoslavia, and Romania prefer cereals and nuts.
 To display country information in the Data Info window, as in the image above: Right-click the Data Info window and select Preferences.... Highlight Country in the left-panel, then click the Select button (the right-pointing arrow) to add Country to the Data Info display, then click OK. Note: Since Origin 2019, you can simply hover on a data point to show a tooltip with data point coordinate information. Both the tooltip and the Data Info display are customizable. See The Data Info Window and Data Point Tooltip for more information.