Principal Components

8.1 Principal Components

In order to illustrate the contribution of data variables to the data variability, let us inspect Figure 8.1 where three datasets with a bivariate normal distribution are shown.

In Figure 8.1a, variables X and Y are uncorrelated and have the same variance, σ 2 = 1. The circle is the equal density curve for a 2 σ deviation from the mean. Any linear combination of X and Y corresponds, in this case, to a radial direction exhibiting the same variance. Thus, in this situation, X and Y are as good in describing the data as any other orthogonal pair of variables.

a 0 1 2 3 4 5 6 b 0 1 2 3 4 5 6 c 0 1 2 3 4 5 6 Figure 8.1. Bivariate, normal distributed datasets showing the standard deviations along X and Y with dark grey bars: a) Equal standard deviations (1); b) Very small standard deviation along Y (0.15); and c) Correlated variables of equal standard deviations (1.31) with a light-grey bar showing the standard deviation of the main principal component (3.42).

8 Data Structure Analysis

In Figure 8.1b, X and Y are uncorrelated but have different variances, namely a

very small variance along Y, 2 σ = 0.0225. The importance of Y in describing the Y data is tenuous. In the limit, with 2 σ → 0, Y would be discarded as an interesting Y

variable and the equal density ellipsis would converge to a line segment. In Figure 8.1c, X and Y are correlated ( ρ = 0.99) and have the same variance, σ 2 =1.72. In this case, as shown in the figure, any equal density ellipsis leans along the regression line at 45º. Based only on the variances of X and Y, we might be led to the idea that two variables are needed in order to explain the variability of the data. However, if we choose an orthogonal co-ordinate system with one axis along the regression line, we immediately see that we have a situation similar to Figure 8.1b, that is, only one hidden variable (absent in the original data), say Z, with high standard deviation (3.42) is needed (light-grey bar in Figure 8.1c). The other orthogonal variable is responsible for only a residual standard deviation (0.02). A variable that maximises a data variance is called a principal component of the data. Using only one variable, Z, instead of the two variables X and Y, amounts to a dimensional reduction of the data.

Consider a multivariate dataset, with x = [X 1 X 2 …X d ] , and let S denote the ’

sample covariance matrix of the data (point estimate of the population covariance Σ), where each element s ij is the covariance between variables X i and X j , estimated as follows for n cases (see A.8.2):

1 n s ij = ( x ki − x i )( x kj − x j ) . 8.1

Notice that covariances are symmetric, s ij =s

2 ji , and that s ii is the usual estimate

of the variance of X i , s i . The covariance is related to the correlation, estimated as:

( x ki − x i )( x kj − x j )

ij = = 1 r ij k = ,

with r ij ∈ [ − 1 , 1 ] . 8.2

Therefore, the correlation can be interpreted as a standardised covariance. In order to obtain the principal components of a dataset, we search uncorrelated

linear combinations of the original variables whose variances are as large as possible. The first principal component corresponds to the direction of maximum variance; the second principal component corresponds to an uncorrelated direction that maximises the remaining variance, and so on. Let us shift the co-ordinate

system in order to bring the sample mean to the origin, x c =x – x . The

maximisation process needed to determine the ith principal component as a linear combination of x c co-ordinates, z i =u i ’ (x – x ), is expressed by the following equation (for details see e.g. Fukunaga K, 1990, or Jolliffe IT, 2002):

(S – λ i i = 0, I) u

8.1 Principal Components 331

where I is the d × d unit matrix, λ i is a scalar and u i is a d × 1 column vector of the linear combination coefficients.

In order to obtain non-trivial solutions of equation 8.3, one needs to solve the determinant equation |S – λ I| = 0. There are d scalar solutions λ i of this equation called the eigenvalues or characteristic values of S, which represent the variances for the new variables z i . After solving the homogeneous system of equations for the different eigenvalues, one obtains a family of eigenvectors or characteristic vectors u i , such that ∀ i, j u i ’ u j = 0 (orthogonal system of uncorrelated variables). Usually, one selects from the family of eigenvectors those that have unit length, u i ’ u i = 1, ∀ i (orthonormal system).

We will now illustrate the process of the computation of eigenvalues and eigenvectors for the covariance matrix of Figure 8.1c:

 1 . 72 1 . 7  S = 

The eigenvalues are computed as:

For λ 1 the homogeneous system of equations is:

from where we derive the unit length eigenvector: u 1 = [0.7071 0.7071] ’ ≡[ 1 / 2

1 / 2 ] . For ’ λ 2 , in the same way we derive the unit length eigenvector orthogonal

to u 1 :u 2 =[ −0.7071 0.7071] ≡ [− ’ 1 / 2 1 / 2 ] . Thus, the principal components ’

of the co-ordinates are Z 1 = (X 1 +X 2 )/ 2 and Z 2 =( –X 1 +X 2 )/ 2 with variances

3.42 and 0.02, respectively. The unit length eigenvectors make up the column vectors of an orthonormal

matrix U (i.e., U −1 = U ) used to determine the co-ordinates of an observation x in ’

the new uncorrelated system of the principal components:

z = U (x – x ). ’

These co-ordinates in the principal component space are often called “z-scores”. In order to avoid confusion with the previous meaning of z-scores – standardised data with zero mean and unit variance – we will use the term pc-scores instead.

The extraction of principal components is basically a variance maximising rotation of the original variable space. Each principal component corresponds to a certain amount of variance of the whole dataset. For instance, in the example

portrayed in Figure 8.1c, the first principal component represents λ 1 /( λ 1 + λ 2 ) = 99%

8 Data Structure Analysis

of the total variance. In short, u 1 alone contains practically all the information about the data; the remaining u 2 is residual “noise”.

Let Λ represent the diagonal matrix of the eigenvalues:

 8.5 K K K K  

The following properties are verified:

1. U S U = ’ Λ and S = U Λ U . 8.6 ’

2. The determinant of the covariance matrix, |S|, is:

|S | = | Λ|= λ 1 λ 2 … λ d .

|S | is called the generalised variance and its square root is proportional to the area or volume of the data cluster since it is the product of the ellipsoid axes.

3. The traces of S and Λ are equal to the sum of the variances of the variables:

tr(S) = tr( Λ) = 2 2 s 2 1 + s 2 + K + s d .

Based on this property, we measure the contribution of a variable X k by

k / ∑ λ i = λ k /( s 1 + s 2 + K + s d ), as we did previously.

The contribution of each original variable X j to each principal component Z i can

be assessed by means of the corresponding sample correlation between X j and Z i , often called the loading of X j :

r ij = ( u ji λ i ) / s j . 8.9

Function pccorr implemented in MATLAB and R and supplied in Tools (see Commands 8.1) allows computing the r ij correlations.

Example 8.1

Q: Consider the best class of the Cork Stoppers’ dataset (first 50 cases). Compute the covariance matrix and their eigenvalues and engeivectors using the original variables ART and PRT. Determine the algebraic expression and contribution of the main principal component, its correlation with the original variables as well as the new co-ordinates of the first cork-stopper.

A: We use MATLAB to perform the necessary computations (see Commands 8.1). Let cork represent the data matrix with all 10 features. We then use:

8.1 Principal Components 333

» % Extract 1st class ART and PRT from cork » x = [cork(1:50,1) cork(1:50,3)]; » S = cov(x);

% covariance matrix

» [u,lambda,e] = pcacov(S); % principal components » r = pccorr(x); % correlations

The results S, u, lambda, e and r are shown in Table 8.1. The scatter plots of the data using the original variables and the principal components are shown in Figure 8.2. The pc-scores can be obtained with:

» xc = x-ones(50,1)*mean(x); » z = (u *xc ) ; ’ ’’

We see that the first principal component with algebraic expression, −0.3501 × ART −0.9367 × PRT, highly correlated with the original variables, explains

almost 99% of the total variance. The first cork-stopper, represented by [81 250] ’

in the ART-PRT plane, maps into:

The eigenvector components are the cosines of the angles subtended by the principal components in the ART-PRT plane. In Figure 8.2a, this result can only be visually appreciated after giving equal scales to the axes.

Table 8.1. Eigenvectors and eigenvalues obtained with MATLAB for the first class of cork-stoppers (variables ART and PRT).

Covariance Eigenvectors Eigenvalues Explained Correlations variance

for z 1

e (%) r 1j 0.1849 0.4482 −0.3501 −0.9367

An interesting application of principal components is in statistical quality control . The possibility afforded by principal components of having a much- reduced set of variables explaining the whole data variability is an important advantage. Instead of controlling several variables, with the same type of Error Type I degradation as explained in 4.5.1, sometimes only one variable needs to be controlled.

Furthermore, principal components afford an easy computation of the following

Hotteling’s T 2 measure of variability:

8 Data Structure Analysis T 2 = ( x −

Critical values of T 2 are computed in terms of the F distribution as follows:

b -60 -300 -200 -100

Figure 8.2. Scatter plots obtained with MATLAB of the cork-stopper data (first class) represented in the planes: a) ART-PRT with superimposed principal components; b) Principal components. The first cork is shown with a solid circle.

cork # 10 -2 0 5 10 15 20 25 30 35 40 45 50

Figure 8.3. T 2 chart for the first class of the cork-stopper data. Case #20 is out of control.

Example 8.2

Q: Determine the Hotteling’s T 2 control chart for the previous Example 8.1 and find the corks that are “out of control” at a 95% confidence level.

8.1 Principal Components 335

A: The Hotteling’s T 2 values can be determined with MATLAB princomp function. The 95% critical value for F 2,48 is 3.19; hence, the 95% critical value for the Hotteling’s T 2 , using formula 8.11, is computed as 6.51. Figure 8.3 shows the corresponding control chart. Cork #20 is clearly “out of control”, i.e., it should be reclassified. Corks #34 and #39 are borderline cases.

Commands 8.1. SPSS, STATISTICA, MATLAB and R commands used to perform principal component and factor analyses.

SPSS

Analyze; Data Reduction; Factor

STATISTICA Statistics; Multivariate Exploratory Techniques; Factor Analysis [u,l]=eig(C); [pc, lat, expl] = pcacov(C)

[pc, score, lat, tsq]= princomp(x) MATLAB

residuals = pcares(x,ndim) [ndim,p,chisq] = barttest(x,alpha) r = pccorr(x) ; f=velcorr(x,icov)

eigen(C) ; prcomp(x) ; princomp(x) R

screeplot(p) factanal(x,factors,scores,rotation) pccorr(x) ; velcorr(x,icov)

SPSS and STATISTICA commands are of straightforward use. SPSS and STATISTICA always use the correlation matrix instead of the covariance matrix for computing the principal components. Figure 8.4 shows STATISTICA specification window for the selection of the two most important components with eigenvalues above 1. If one wishes to obtain all principal components one should set the Min. eigenvalue to 0 and the Max. no. of factors to the data dimension.

The MATLAB eig function returns the eigenvectors, u, and eigenvalues, l, of

C. The pcacov function determines the principal components of a covariance matrix

a covariance matrix

C, which are returned in pc. The return vectors lat and expl store the variances and contributions of the principal components to the total variance, respectively. The princomp function returns the principal components and eigenvalues of a data matrix x in pc and lat, respectively. The pc-scores and

Hotteling’s T 2 are returned in score and tsq, respectively. The pcares function returns the residuals obtained by retaining the first ndim principal components of x. The barttest function returns the number of dimensions to retain together

with the Bartlett’s test probabilities, p, and χ 2 scores, chisq (see section 8.2).

The MATLAB implemented pccorr function computes the partial correlations between the original variables and the principal components of a data matrix x. The velcorr function computes the Velicer partial correlations (see section 8.2)

8 Data Structure Analysis

using matrix x either as data matrix (icov ≠ 0) or as covariance matrix ( icov = 0).

The R eigen function behaves as the MATLAB eig function. For instance, the eigenvalues and eigenvectors of Table 8.1 can be obtained with eigen(cov(cbind(ART[1:50],PRT[1:50]))). The prcomp function computes among other things the principal components (curiously, called “rotation” or “loadings” in R) and their standard deviations (square roots of the eigenvalues). For the dataset of Example 8.1 one would use:

> p<-prcomp(cbind(ART[1:50],PRT[1:50])) >p Standard deviations: [1] 117.65407 13.18348

Rotation: PC1 PC2 [1,] 0.3500541 0.9367295 [2,] 0.9367295 -0.3500541

We thus obtain the same eigenvectors ( PC1 and PC2) as in Table 8.1 (with an unimportant change of sign). The standard deviations are the square roots of the eigenvalues listed in Table 8.1. With the R princomp function, besides the principal components and their standard deviations, one can also obtain the data projections onto the eigenvectors (the so-called scores in R).

A scree plot (see section 8.2) can be obtained in R with the screeplot function using as argument an object returned by the princomp function. The R factanal function performs factor analysis (see section 8.4) of the data matrix x returning the number of factors specified by factors with the specified rotation method. Bartlett’s test scores can be specified with scores.

The R implemented functions pccorr and velcorr behave in the same way as their MATLAB counterparts.

Figure 8.4. Partial view of STATISTICA specification window for principal component analysis with standardised data.

8.2 Dimensional Reduction