Non-parametric Correlation Coefficients

8.4 Non-parametric Correlation Coefficients

As seen in the chapters on group differences, common non-parametric analyses are of ranked ordinal data in place of the original data. The application to ranked data results in a statistic that is more resistant to outliers than the corresponding parametric statistic, and also does not

method=

assume underlying normality. The same principles apply to the two non-parametric correlation

"spearman"

coefficients provided by R , the Spearman and Kendall coefficients. To invoke either of these

option: Spearman

correlation.

coefficients, add method="spearman" or method="kendall" to the calls to the ScatterPlot method="kendall" and Correlation functions.

option: Kendall correlation.

8.4.1 Spearman Correlation

The Spearman correlation coefficient is the application of the Pearson formula directly to the ranked

Spearman

data. A perfect Spearman correlation results when both sets of ranked data align perfectly, which

correlation: Pearson correlation

occurs when each person has the same rank on each of the two variables. Because the data are

of ranks.

expressed as ranks, any transformation that preserves ranks of the values of one of the variables leaves the Spearman correlation unchanged. As such, the Spearman correlation also applies to nonlinear relationships.

If the relationship of two normally distributed variables is linear, the Pearson and Spearman coefficients tend to approximate each other. But when there is nonlinearity that preserves order and/or outliers, the two correlation coefficients tend to diverge. Consider a variable and a transformed variable that is each of the original data values raised to the third power. The Pearson correlation between the two variables can be considerably less than 1, yet the corresponding Spearman correlation is exactly 1.0. As previously defined, two variables are related when as one variable increases, the other tends to either increase or decrease. This relationship is more generally assessed by the Spearman coefficient, which does not require linearity.

Calculate the Spearman coefficient in place of the default Pearson coefficient.

lessR Input Calculate the Spearman correlation coefficient > Correlation(Years, Salary, method="spearman")

The usual name for the Spearman coefficient is rho, which appears in the output in Listing 8.10 . The output includes both the descriptive correlation coefficient in the sample,

80, but also a hypothesis test that the population value is zero, or, more precisely, that the two variables are not related. Reject the null hypothesis of no relation.

rho = 0 .

Test of Spearman population correlation of 0: p -value = 0.000 < α = 0 . 05 , reject H 0

Correlation 201

Spearman’s rank correlation rho Years, Annual Salary (USD)

Salary, Years Employed in the Company Number of paired values with neither missing, n: 36

Number of cases (rows of data) deleted: 1 Sample Correlation of Years and Salary: rho = 0.800 Alternative Hypothesis: True rho is not equal to 0

S-value: 1553.770, p-value: 0.000

Listing 8.10 Analysis of the Spearman correlation coefficient for Years and Salary.

8.4.2 Kendall Correlation

concordant pair

The Kendall correlation coefficient is based on a direct analysis of what are called concordant of data values: The pairs. Consider any two pairs of data values, X i , Y i and X j , Y j . If X i − X j and Y i − Y j have the same two values for each variable change in sign then the pair of data values is called concordant. Similarly, if X i − X j and Y i − Y j have the the same direction. opposite sign, the pair of data values is called discordant. If the corresponding value of Y always

increases as the value of X increases, all pairs of data values are concordant. Similarly, for an discordant pair of

data values: The

inverse relationship, if Y always decreases as X increases, all pairs of data values are discordant. values for each The numerator of the variable change in Kendall correlation coefficient is the number of concordant pairs minus

opposite

the number of discordant pairs of data values. To normalize this result so that the resulting directions.

coefficient lies between − 1 and 1, divide this value by the number of all possible pairs, n(n − 1) / 2, Kendall

where n is the sample size. Achieve the maximum value +1 if all n(n − 1) / 2 pairs are concordant, correlation: and achieve the minimum value Based on number − 1 if all pairs are discordant.

of concordant and

To illustrate, return to the example for the correlation matrix of Pearson correlations in discordant pairs of data values. Figure 8.11 . Now generate the corresponding matrix for the same variables but with Kendall correlation coefficients and store in mycor .

lessR Input Correlation matrix with Kendall correlations > mycor <- Correlation(c(m06, m07, m09, m10, m15, m02),

method="kendall")

The matrix excerpted from the Correlation output appears in Listing 8.11 .

m02 m06 1.00 0.47 0.24 0.34 -0.14 -0.06 m07 0.47 1.00 0.30 0.36 -0.15 -0.09 m09 0.24 0.30 1.00 0.27 -0.17 -0.18 m10 0.34 0.36 0.27 1.00 -0.17 -0.22

m15

m15 -0.14 -0.15 -0.17 -0.17 1.00 0.22 m02 -0.06 -0.09 -0.18 -0.22 0.22 1.00

Listing 8.11 Correlation matrix of Kendall correlations.

202 Correlation

In this example, the Kendall correlations in Listing 8.11 are approximately the same as the corresponding Pearson correlations in Figure 8.11 . The largest discrepancy of these correlations is for the largest correlation, between Items m06 and m07 . The Pearson correlation is 0.52 and the Kendall correlation is .05 lower, at 0.47.

Worked Problems

?Cars93 for more

1 Refer to the Cars93 data set, which is part of lessR.

information.

> mydata <- Read("Cars93", format="lessR")

(a) Obtain the scatter plot and correlation for MPGcity and MPGhiway. Comment. (b) Calculate the correlation matrix and scatter plot for the three prices for each car:

MinPrice, MidPrice and MaxPrice. Comment. (c) From the correlation matrix of all numeric variables, which five variables are most

correlated with MPGcity? (d) Why is the scatter plot matrix of all numeric variables not useful?

2 Compare the usual Pearson correlation with the corresponding non-parametric Spearman

and Kendall correlations.

(a) Create a data vector X of 25 values of simulated data values from a random normal distribution with a mean of 0 and a standard deviation of 1. Create a second data vector, X3, which consists of the cubed values of X.

(b) Generate the scatter plot of X and X3. Comment. Is it linear?

(c) Calculate the Pearson correlation coefficient of X and X3, as well as the Spearman and

Kendall correlation coefficients. (d) Compare and account for the values of the three correlation coefficients.

CHAPTER 9

REGRESSION I