Measures of Association for Continuous Variables

2.3.4 Measures of Association for Continuous Variables

The correlation coefficient is the most popular measure of association for continuous type data. For a dataset with two variables, X and Y, the sample

estimate of the correlation coefficient ρ XY (see definition in A.8.2) is computed as:

s r ≡ r XY = XY , 2.18 s X s Y

2.3 Summarising the Data

where s XY , the sample covariance of X and Y, is computed as:

XY = ∑ i = 1 ( x i − x ) ( y i − y ) /( n − 1 ) . 2.19

Note that the correlation coefficient (also known as Pearson correlation) is a dimensionless measure of the degree of linear association of two r.v., with value in the interval [ −1, 1], with:

0 : No linear association (X and Y are linearly uncorrelated);

1 : Total linear association, with X and Y varying in the same direction; −1: Total linear association, with X and Y varying in the opposite direction.

Figure 2.26 shows scatter plots exemplifying several situations of correlation. Figure 2.26f illustrates a situation where, although there is an evident association between X and Y, the correlation coefficient fails to measure it since X and Y are not linearly associated.

Note that, as described in Appendix A (section A.8.2), adding a constant or multiplying by a constant any or both variables does not change the magnitude of the correlation coefficient. Only a change of sign can occur if one of the multiplying constants is negative.

The correlation coefficients can be arranged, in general, into a symmetrical correlation matrix, where each element is the correlation coefficient of the respective column and row variables.

Table 2.9. Correlation matrix of five variables of the cork stopper dataset. N ART PRT ARTG PRTG

Q: Compute the correlation matrix of the following five variables of the Cork Stoppers’ dataset: N, ART, PRT, ARTG, PRTG.

A: Table 2.9 shows the (symmetric) correlation matrix corresponding to the five variables of the cork stopper dataset (see Commands 2.9). Notice that the main diagonal elements (from the upper left corner to the right lower corner) are all equal to one. In a later chapter, we will learn how to correctly interpret the correlation values displayed.

68 2 Presenting and Summarising the Data

In multivariate problems, concerning datasets described by n random variables,

X 1 ,X 2 , …, X n , one sometimes needs to assess what is the degree of association of two variables, say X 1 and X 2 , under the hypothesis that they are linearly estimated by the remaining n – 2 variables. For this purpose, the correlation ρ X 1X 2 is defined in terms of the marginal distributions of X 1 or X 2 given the other variables, and is then called the partial correlation of X 1 and X 2 given the other variables. Details on partial correlations will be postponed to Chapter 7.

a -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 b -0.04 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

c -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 d -0.2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

e -0.2 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 f -0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Figure 2.26. Sample correlation values for different datasets: a) r = 1; b) r = –1;

c) r = 0; d) r = 0.81; e) r = – 0.21; f) r = 0.04.

2.3 Summarising the Data

STATISTICA and SPSS afford the possibility of computing partial correlations as indicated in Commands 2.9. For the previous example, the partial correlation of PRTG and ARTG, given PRT and ART, is 0.79. We see, therefore, that PRT and ART can “explain” about 20% of the high correlation (0.99) of those two variables.

Another measure of association for continuous variables is the multiple correlation coefficient, which measures the degree of association of one variable Y

in relation to a set of variables, X 1 ,X 2 , …, X n , that linearly “predict” Y. Details on

multiple correlation will be postponed to Chapter 7.

Commands 2.9. SPSS, STATISTICA, MATLAB and R commands used to obtain measures of association for continuous variables.

SPSS Analyze; Correlate; Bivariate | Partial Statistics; Basic Statistics/Tables;

STATISTICA Correlation matrices (Quick |Advanced; Partial Correlations)

MATLAB corrcoef(x) ; cov(x) R

cor(x,y) ; cov(x,y)

Partial correlations are computed in MATLAB and R as part of the regression functions (see Chapter 7).