Measures of Association for Ordinal Variables

2.3.5 Measures of Association for Ordinal Variables

2.3.5.1 The Spearman Rank Correlation

When dealing with ordinal data the correlation coefficient, previously described, can be computed in a simplified way. Consider the ordinal variables X and Y with ranks between 1 and N. It seems natural to measure the lack of agreement between

X and Y by means of the difference of the ranks d i =x i −y i for each data pair (x i ,y i ).

Using these differences we can express 2.18 as:

i = 1 x i + ∑ i = 1 y i − r ∑ = i = 1 d i . 2.20

Assuming the values of x i and y i are ranked from 1 through N and that there are no tied ranks in any variable, we have:

i = 1 x i = ∑ i = 1 y i = ( N − N ) / 12 .

Applying this result to 2.20, the following Spearman’s rank correlation (also known as rank correlation coefficient) is derived:

70 2 Presenting and Summarising the Data

2 , 2.21 N ( N − 1 )

When tied ranks occur − i.e., two or more cases receive the same rank on the same variable −, each of those cases is assigned the average of the ranks that would have been assigned had no ties occurred. When the proportion of tied ranks is small, formula 2.21 can still be used. Otherwise, the following correction factor is computed:

where g is the number of groupings of different tied ranks and t i is the number of tied ranks in the ith grouping. The Spearman’s rank correlation with correction for tied ranks is now written as:

x + T y )( N − N ) + T x T y

where T x and T y are the correction factors for the variables X and Y, respectively.

Table 2.10. Contingency table obtained with SPSS of the NC, PRTGC variables (cork stopper dataset).

NC 0 Count

% of Total

% of Total

% of Total

% of Total

Total Count

% of Total

Q: Compute the rank correlation for the variables N and PRTG of the Cork Stopper’ dataset, using two new variables, NC and PRTGC, which rank N and PRTG into 4 categories, according to their value falling into the 1 st ,2 nd ,3 rd or 4 th quartile intervals.

2.3 Summarising the Data

A: The new variables NC and PRTGC can be computed using formulas similar to the formula used in 2.1.6 for computing PClass. Specifically for NC, given the values of the three N quartiles, 59 (25%), 78.5 (50%) and 95 (75%), respectively, NC coded in {0, 1, 2, 3} is computed as:

NC = (N>59)+(N>78.5)+(N>95)

The corresponding contingency table is shown in Table 2.10. Note that NC and PRTGC are ordinal variables since their ranks do indeed satisfy an order relation. The rank correlation coefficient computed for this table (see Commands 2.10) is 0.715 which agrees fairly well with the 0.72 correlation computed for the corresponding continuous variables, as shown in Table 2.9.

2.3.5.2 The Gamma Statistic

Another measure of association for ordinal variables is based on a comparison of the values of both variables, X and Y, for all possible pairs of cases (x, y). Pairs of cases can be:

– Concordant (in rank order): The values of both variables for one case are higher (or are both lower) than the corresponding values for the other case. For instance, in Table 2.10 (X = NC; Y = PRTGC), the pair {(0, 0), (2, 1)} is concordant.

– Discordant (in rank order): The value of one variable for one case is higher than the corresponding value for the other case, and the direction is reversed for the other variable. For instance, in Table 2.10, the pair {(0, 2), (3, 1)} is discordant.

– Tied (in rank order): The two cases have the same value on one or on both variables. For instance, in Table 2.10, the pair {(1, 2), (3, 2)} are tied.

The following γ measure of association (gamma coefficient) is defined:

P ( Concordant ) − P ( Discordant )

P ( Concordant ) − P ( Discordant )

1 − P ( Tied )

P ( Concordant ) + P ( Discordant )

Let P and Q represent the total counts for the concordant and discordant cases, respectively. A point estimate of γ is then:

, 2.24 P + Q

with P and Q computed from the counts n ij (of table cell ij), of a contingency table with r rows and c columns, as follows:

P ∑∑ i = 1 j = 1 n ij N

∑∑ i = 1 j = 2 n ij N ij , 2.25

ij

72 2 Presenting and Summarising the Data

where the N + ij is the sum of all counts below and to the right of the ijth cell, and the N − ij is the sum of all counts below and to the left of the ijth cell.

The gamma measure varies, as does the correlation coefficient, in the interval [ −1, 1]. It will be 1 if all the frequencies lie in the main diagonal of the table (from the upper left corner to the lower right corner), as for all cases where there are no discordant contributions (see Figure 2.27a). It will be –1 if all the frequencies lie in the other diagonal of the table, and also for all cases where there are no concordant contributions (see Figure 2.27b). Finally, it will be zero when the concordant contributions balance the discordant ones.

The G value for the example of Table 2.10 is 0.785. We will see in Chapter 5 the significance of the G statistic.

There are other measures of association similar to the gamma coefficient that are applicable to ordinal data. For more details the reader can consult e.g. (Siegel S, Castellan Jr NJ, 1988).

Commands 2.10. SPSS, STATISTICA, MATLAB and R commands used to

obtain measures of association for ordinal variables.

SPSS Analyze; Descriptive

Statistics; Crosstabs STATISTICA Statistics; Basic Statistics/Tables; Tables and Banners; Options

MATLAB corrcoef(x) ; gammacoef(t) R

cor(x) ; gammacoef(t)

Measures of association for ordinal variables are obtained in SPSS and STATISTICA as a result of applying contingency table analysis with the commands listed in Commands 5.7.

MATLAB Statistics toolbox and R stats package do not provide a function for computing the gamma statistic. We provide, however, MATLAB and R functions for that purpose in the book CD (see Appendix F).

x 1 x x x 2 xx

b x 3 x Figure 2.27. Examples of contingency table formats for: a) G = 1 ( N − ij cells are

shaded gray); b) G = –1 ( N + ij cells are shaded gray).

2.3 Summarising the Data