Testing a Correlation
4.4.1 Testing a Correlation
When analysing two associated sample variables, one is often interested in knowing whether the sample provides enough evidence that the respective random variables are correlated. For instance, in data classification, when two variables are
4.4 Inference on Two Populations 127
correlated and their correlation is high, one may contemplate the possibility of discarding one of the variables, since a highly correlated variable only conveys redundant information.
Let ρ represent the true value of the Pearson correlation mentioned in section
2.3.4. The correlation test is formalised as:
H 0 : ρ = 0, H 1 : ρ ≠ 0, for a two-sided test.
For a one-sided test the alternative hypothesis is:
H 1 : ρ > 0 or ρ < 0.
Let r represent the sample Pearson correlation when the null hypothesis is verified and the sample size is n. Furthermore, assume that the random variables are normally distributed. Then, the (r.v. corresponding to the) following test statistic:
4.6 1 r
has a Student’s t distribution with n – 2 degrees of freedom.
The Pearson correlation test can be performed as part of the computation of correlations with SPSS and STATISTICA. It can also be performed using the Correlation Test sheet of Tools.xls (see Appendix F) or the Probability Calculator; Correlations of STATISTICA (see also Commands 4.2).
Example 4.5
Q: Consider the variables PMax and T80 of the meteorological dataset ( Meteo) for the “moderate” category of precipitation (PClass = 2) as defined in 2.1.2. We then have n = 16 measurements of the maximum precipitation and the maximum temperature during 1980, respectively. Is there evidence, at α = 0.05, of a negative correlation between these two variables?
A: The distributions of PMax and T80 for “moderate” precipitation are reasonably well approximated by the normal distribution (see section 5.1). The sample correlation is r = –0.53. Thus, the test statistic is:
r = –0.53, n = 16 ⇒ t * = –2.33.
14 , 0 . 05 =− 1. 76 , the value of t falls in the critical region ] – ∞, –1.76]; therefore, the null hypothesis is rejected, i.e., there is evidence of a negative correlation between PMax and T80 at that level of significance. Note that the
Since * t
observed significance of t * is 0.0176, below α.
4 Parametric Tests of Hypotheses
Commands 4.2. SPSS, STATISTICA, MATLAB and R commands used to perform the correlation test.
SPSS
Analyze; Correlate; Bivariate Statistics; Basic Statistics and Tables;
STATISTICA Correlation Matrices Probability Calculator; Correlations
MATLAB [r,t,tcrit] = corrtest(x,y,alpha)
R cor.test(x, y, conf.level = 0.95, ...)
As mentioned above the Pearson correlation test can be performed as part of the computation of correlations with SPSS and STATISTICA. Also with the Correlations option of STATISTICA Probability Calculator.
MATLAB does not have a correlation test function. We do provide, however, a function for that purpose, corrtest (see Appendix F). Assuming that we have available the vector columns pmax, t80 and pclass as described in 2.1.2.3, Example 4.5 would be solved as:
>>[r,t,tcrit]=corrtest(pmax(pclass==2),t80(pclass==2) ,0.05) r= -0.5281 t= -2.3268 tcrit = -1.7613
The correlation test can be performed in R with the function cor.test. In Commands 4.2 we only show the main arguments of this function. As usual, by default conf.level=0.95. Example 4.5 would be solved as:
> cor.test(T80[Pclass==2],Pmax[Pclass==2]) Pearson’s product-moment correlation data: T80[Pclass == 2] and Pmax[Pclass == 2] t = -2.3268, df = 14, p-value = 0.0355 alternative hypothesis: true correlation is not equal
to 0
95 percent confidence interval: -0.81138702 -0.04385491 sample estimates: cor -0.5280802
4.4 Inference on Two Populations
As a final comment, we draw the reader’s attention to the fact that correlation is by no means synonymous with causality. As a matter of fact, when two variables
X and Y are correlated, one of the following situations can happen:
– One of the variables is the cause and the other is the effect. For instance, if
X = “nr of forest f ires per year” and Y = “area of burnt forest per year”, then one usually finds that X is correlated with Y, since Y is the effect of X
– Both variables have an indirect cause. For instance, if X = “% of persons daily arriving at a Hospital with yellow-tainted fingers” and Y = “% of persons daily arriving at the same Hospital with pulmonary carcinoma”, one finds that X is correlated with Y, but neither is cause or effect. Instead, there is another variable that is the cause of both − volume of inhaled tobacco smoke.
– The correlation is fortuitous and there is no causal link. For instance, one may eventually find a correlation between X = “% of persons with blue eyes per household ” and Y = “% of persons preferring radio to TV per household ”. It would, however, be meaningless to infer causality between the two variables.