Factor Analysis
8.4 Factor Analysis
Let us again consider equation 8.4 which yields the pc-scores of the data using the
d × d matrix U of the eigenvectors:
z = U (x – ’ x ).
Reversely, with this equation we can obtain the original data from their principal components:
x = x + Uz.
If we discard some principal components, using a reduced d × k matrix U k , we no longer obtain the original data, but an estimate xˆ :
xˆ = x +U k z k . 8.18
Using 8.17 and 8.18, we can express the original data in terms of the estimation error e = x – xˆ , as:
x = x +U k z k + (x – xˆ )= x +U k z k +e . 8.19
When all principal components are used, the covariance matrix satisfies
S =U Λ U (see formula 8.6 in the properties mentioned in section 8.1). Using the ’
reduced eigenvector matrix U k , and taking 8.19 into account, we can express S in terms of an approximate covariance matrix S k and an error matrix E:
S =U k ΛU k ’ +E=S k +E .
In factor analysis, the retained principal components are called common factors. Their correlations with the original variables are called factor loadings. Each
common factor u 2
j is responsible by a communality, h i , which is the variability
associated with the original ith variable:
i = ∑ λ j u ij .
The communalities are the diagonal elements of S k and make up a diagonal communality matrix H.
Example 8.8
Q: Compute the approximate covariance, communality and error matrices for Example 8.1.
A: Using MATLAB to carry out the computations, we obtain:
8 Data Structure Analysis
In the previous example, we can appreciate that the matrix of the diagonal elements of E is the difference between the matrix of the diagonal elements of S and H:
diagonal S ( ) =
diagonal H ( ) =
diagonal ( E ) =
= diagonal ( S ) − diagonal ( H )
In factor analysis, one searches for a solution for the equation 8.20, such that E is a diagonal matrix, i.e., one tries to obtain uncorrelated errors from the component estimation process. In this case, representing by D the matrix of the diagonal elements of S, we have:
S =S k + (D – H). 8.22
In order to cope with different units of the original variables, it is customary to carry out the factor analysis on correlation matrices:
R =R k + (I – H). 8.23
There are several algorithms for finding factor analysis solutions which basically improve current estimates of communalities and factors according to a specific criterion (for details see e.g. Jackson JE, 1991). One such algorithm, known as principal factor analysis, starts with an initial estimate of the communalities, e.g. as the multiple R square of the respective variable with all other variables (see formula 7.10). It uses a principal component strategy in order to iteratively obtain improved estimates of communalities and factors.
In principal component analysis, the principal components are directly computed from the data. In factor analysis, the common factors are estimates of unobservable variables, called latent variables, which model the data in such a way that the remaining errors are uncorrelated. Equation 8.19 then expresses the observations x in terms of the latent variables z k and uncorrelated errors e. The true values of the observations x, before any error has been added, are values of the so- called manifest variables.
8.4 Factor Analysis 349
The main benefits of factor analysis when compared with principal component analysis are the non-correlation of the residuals and the invariance of the solutions with respect to scale change.
After finding a factor analysis solution, it is still possible to perform a new transformation that rotates the factors in order to achieve special effects as, for example, to align the factors with maximum variability directions (varimax procedure).
Example 8.9
Q: Redo Example 8.8 using principal factor analysis with the communalities computed by the multiple R square method.
A: The correlation matrix is:
1 0 . 945 R = 0 . 945
Starting with communalities = multiple R 2 square = 0.893, STATISTICA ( 2 Communalities = multiple R ) converges to solution:
For unit length eigenvectors, we have:
Thus: R 1 + (I – H) =
We see that the residual cross-correlations are only 0.945 – 0.919 = 0.026.
Example 8.10
Q: Redo Example 8.7 using principal factor analysis and varimax rotation.
A: Using STATISTICA with 2 Communalities=Multiple R checked (see Figure 8.4) in order to apply formula 8.21, we obtain the solution shown in Figure
8.11. The varimax procedure is selected in the Factor rotation box included in the Loadings tab (after clicking OK in the window shown in Figure 8.4).
The rock dataset projected onto the factor plane shown in Figure 8.11 leads us to
the same conclusions as in Example 8.7, stressing the opposition SiO 2 -CaO and “aligning” the factors in such a way that facilitates the interpretation of the data structure.
8 Data Structure Analysis
Fe2O3 Al2O3
-2 Granite Diorite Marble Slate Limestone
-3 F1-type variable F2-type variable
F1
0.0 0.5 1.0 1.5 2.0 Figure 8.11. Partial view of the rock dataset projected onto the F1-F2 factor plane, after varimax rotation, overlaid with the factor loadings plot.
-1.0 -0.5
Exercises
8.1 Consider the standardised electrical impedance features of the Breast Tissue dataset and perform the following principal component analyses: a) Check that only two principal components are needed to explain the data
according to the Guttman-Kaiser, broken stick and Velicer criteria. b) Determine which of the original features are highly correlated to the principal components found in a).
c) Using a scatter plot of the pc-scores check that the {ADI, CON} class set is separated from all other classes by the first principal component only, whereas the discrimination of the carcinoma class requires the two principal components. (Compare with the results of Examples 6.17 and 6.18.) d) Redo Example 6.16 using the principal components as classifying features. Compare the classification results with those obtained previously.
8.2 Perform a principal component analysis of the correlation matrix of the chemical and grading features of the Clays dataset, showing that: ’ a) The scree plot has a slow decay after the first eigenvalue. The Velicer criterion
indicates that only the first two eigenvalues should be retained. b) The pc correlations show that the first principal component reflects the silica- alumina content of the clays; the second principal component reflects the lime content; and the third principal component reflects the grading.
Exercises 351
c) The scatter plot of the pc-scores of the first two principal components indicates a
good discrimination of the two clay types (holocenic and pliocenic).
8.3 Redo the previous Exercise 8.2 using principal factor analysis. Show that only the first factor has a high loading with the original features, namely the alumina content of the clays.
8.4 Design a classifier for the first two classes of the Cork Stoppers dataset using the ’ main principal components of the data. Compare the classification results with those obtained in Example 6.4.
8.5 Consider the CTG dataset with 2126 cases of foetal heart rate (FHR) features computed in normal, suspect and pathological FHR tracings (variable NSP). Perform a principal component analysis using the feature set {LB, ASTV, MSTV, ALTV, MLTV, WIDTH, MIN, MAX, MODE, MEAN, MEDIAN, V} containing continuous-type features. a) Show that the two main principal components computed for the standardised
features satisfy the broken-stick criterion. b) Obtain a pc correlation plot superimposed onto the pc-scores plot and verify that: first, there is a quite good discrimination of the normal vs. pathological cases with the suspect cases blending in the normal and pathological clusters; and that there are two pathological clusters, one related to a variability feature (MSTV) and the other related to FHR histogram features.
8.6 Using principal factor analysis, determine which original features are the most important explaining the variance of the Firms dataset. Also compare the principal ’ factor solution with the principal component solution of the standardised features and determine whether either solution is capable to conveniently describe the activity branch of the firms.
8.7 Perform a principal component and a principal factor analysis of the standardised features BASELINE, ACELRATE, ASTV, ALTV, MSTV and MLTV of the FHR- Apgar dataset checking the following results: a) The principal factor analysis affords a univariate explanation of the data variance
related to the FHR variability features ASTV and ALTV, whereas the principal component analysis affords an explanation requiring three components. Also check the scree plots. b) The pc-score plots of the factor analysis solution afford an interpretation of the Apgar index. For this purpose, use the varimax rotation and plot the categorised data using three classes for the Apgar at 1 minute after birth ( Apgar1: ≤5; >5 and ≤8; >8) and two classes for the Apgar at 5 minutes after birth (Apgar5: ≤8; >8).
8.8 Redo the previous Exercise 8.7 for the standardised features EF, CK, IAD and GRD of the Infarct dataset showing that the principal component solution affords an explanation of the data based on only one factor highly correlated with the ejection fraction, EF. Check the discrimination capability of this factor for the necrosis severity score SCR > 2 (high) and SCR < 2 (low).
8 Data Structure Analysis
8.9 Consider the Stock Exchange dataset. Using principal factor analysis, determine which economic variable best explains the variance of the whole data.
8.10 Using the Hotteling’s T 2 control chart for the wines of the Wines dataset, determine ’ which wines are “out of control” at 95% confidence level and present an explanation for this fact taking into account the values of the variables highly correlated with the principal components. Use only variables without missing data for the computation of the principal components.
8.11 Perform a principal factor analysis of the wine data studied in the previous Exercise 8.10 showing that there are two main factors, one highly correlated to the GLU-THR variables and the other highly correlated to the PHE-LYS variables. Use varimax rotation and analyse the clustering of the white and red wines in the factor plane superimposed onto the factor loading plane.
8.12 Redo the principal factor analysis of Example 8.10 using three factors and varimax rotation. With the help of a 3D plot interpret the results obtained checking that the three factors are related to the following original variables: SiO2-Al2O3-CaO (silica-lime factor), AAPN-AAOA (porosity factor) and RMCS-RCSG (resistance factor).