Tests for Two Paired Samples

5.3.2 Tests for Two Paired Samples

Commands 5.9. SPSS, STATISTICA, MATLAB and R commands used to perform non-parametric tests on two paired samples.

STATISTICA Statistics; Nonparametrics; Comparing two dependent samples (variables)

SPSS Analyze; Nonparametric Tests; 2 Related

Samples MATLAB

[p,h,stats]=signrank(x,y,alpha) [p,h,stats]=signtest(x,y,alpha)

R mcnemar.test(x) | mcnemar.test(x,y)

wilcox.test(x,y,paired=TRUE)

5.3.2.1 The McNemar Change Test

The McNemar change test is particularly suitable to “before and after” experiments, in which each case can be in either of two categories or responses and is used as its own control. The test addresses the issue of deciding whether or not the change of response is due to hazard. Let the responses be denoted by the + and – signs and a change denoted by an arrow, →. The test is formalised as:

206 5 Non-Parametric Tests of Hypotheses

H 0 : After the treatment, P(+ → –) = P(– → +);

H 1 : After the treatment, P(+ → –) ≠ P(– → +).

Let us use a 2 × 2 table for recording the before and after situations, as shown in Figure 5.5. We see that a change occurs in situations A and D, i.e., the number of cases which change of response is A + D. If both changes of response are equally likely, the expected count in both cells is (A + D)/2.

The McNemar test uses the following test statistic:

5.34 i = 1 E i

The sampling distribution of this test statistic, when the null hypothesis is true, is asymptotically the chi-square distribution with df = 1. A continuity correction is often used, especially for small absolute frequencies, in order to make the computation of significances more accurate.

An alternative to using the chi-square test is to use the binomial test. One would then consider the sample with n = A + D cases, and assess the null hypothesis that the probabilities of both changes are equal to ½.

After

Before

Figure 5.5. Table for the McNemar change test, where A, B, C and D are cell counts.

Example 5.16

Q: Consider that in an enquiry into consumer preferences of two products A and B,

a group of 57 out of 160 persons preferred product A, before reading a study of a consumer protection organisation. After reading the study, 8 persons that had preferred product A and 21 persons that had preferred product B changed opinion. Is it possible to accept, at a 5% level, that the change of opinion was due to hazard?

A: Table 5.21a shows the respective data in a convenient format for analysis with STATISTICA or SPSS. The column “Number” should be used for weighing the cases corresponding to the cells of Figure 5.5 with “1” denoting product A and “2” denoting product B. Case weighing was already used in section 5.1.2.

5.3 Inference on Two Populations 207

Table 5.21b shows the results of the test; at a 5% significance level, we reject the null hypothesis that the change of opinion was due to hazard. In R the test is run (with the same results) as follows:

> x <- array(c(49,21,8,82),dim=c(2,2)) > mcnemar.test(x)

Table 5.21. (a) Data of Example 5.16 in an adequate format for running the McNmear test with STATISTICA or SPSS, (b) Results of the test obtained with SPSS.

Before After Number BEFORE &

2 1 21 Asymp. Sig.

5.3.2.2 The Sign Test

The sign test compares two paired samples (x 1 ,y 1 ), (x 2 ,y 2 ), … , (x n ,y n ), using the sign of the respective differences: (x 1 –y 1 ), (x 2 –y 2 ), … , (x n –y n ), i.e., using a set of dichotomous values (+ and – signs), to which the binomial test described in section 5.1.2 can be applied in order to assess the truth of the null hypothesis:

H 0 : P(x i >y i ) = P(x i <y i )=½.

Note that the null hypothesis can also be stated in terms of the sign of the differences x i – y i , by setting their median to zero. Previous to applying the binomial test, all cases with tied decisions, x i =y i , are removed from the analysis, and the sample size, n, adjusted accordingly. The null hypothesis is rejected if too few differences of one sign occur.

The power-efficiency of the test is about 95% for n = 6, decreasing towards 63% for very large n. Although there are more powerful tests for paired data, an important advantage of the sign test is its broad applicability to ordinal data. Namely, when the magnitude of the differences cannot be expressed as a number, the sign test is the only possible alternative.

Example 5.17

Q: Consider the Metal Firms’ dataset containing several performance indices of a sample of eight metallurgic firms (see Appendix E). Use the sign test in order to analyse the following comparisons: a) leadership teamwork (TW) vs. leadership commitment to quality improvement (CI), b) management of critical processes (MC) vs. management of alterations (MA). Discuss the results.

208 5 Non-Parametric Tests of Hypotheses

A: All variables are ordinal type, measured on a 1 to 5 scale. One must note, however, that the numeric values of the variables cannot be taken to the letter. One could as well use a scale of A to E or use “very poor”, “poor”, “fair”, “good” and “very good”. Thus, the sign test is the only two-sample comparison test appropriate here.

Running the test with STATISTICA, SPSS or MATLAB yields observed one- tailed significances of 0.0625 and 0.5 for comparisons (a) and (b), respectively. Thus, at a 5% significance level, we do not reject the null hypothesis of comparable distributions for pair TW and CI nor for pair MC and MA.

Let us analyse in detail the sign test results for the TW-CI pair of variables. The respective ranks are:

We see that there are 4 ties (marked with 0) and 4 positive differences TW – CI. Figure 5.6a shows the binomial distribution of the number k of negative differences for n = 4 and p = ½. The probability of obtaining as few as zero negative

differences TW – CI, under H 0 , is (½) 4 = 0.0625.

We now consider the MC-MA comparison. The respective ranks are:

b 0 1 2 3 4 5 6 7k c 0 1 2 3 4 5 6 7k Figure 5.6. Binomial distributions for the sign tests in Example 5.18: a) TW-CI

0.00 0.00 a 0.00

0 1 2 3 4k

pair, under H 0 ; b) MC-MA pair, under H 0 ; c) MC-MA pair for the alternative hypothesis H 1 : P(MC < MA) = ¼.

Figure 5.6b shows the binomial distribution of the number of negative differences for n = 7 and p = ½. The probability of obtaining at most 3 negative differences MC – MA, under H 0 , is ½, given the symmetry of the distribution. The critical value of the negative differences, k = 1, corresponds to a Type I Error of

5.3 Inference on Two Populations 209

Let us now determine the Type II Error for the alternative hypothesis “positive differences occur three times more often than negative differences”. In this case, the distributions of MC and MA are not identical; the distribution of MC favours higher ranks than the distribution of MA. Figure 5.6c shows the binomial distribution for this situation, with p = P(MC < MA) = ¼. We clearly see that, in this case, the probability of obtaining at most 3 negative differences MC – MA increases. The Type II Error for the critical value k = 1 is the sum of all probabilities for k ≥ 2, which amounts to β = 0.56. Even if we relax the α level to

0.23 for a critical value k = 2, we still obtain a high Type II Error, β = 0.24. This low power of the binomial test, already mentioned in 5.1.2, renders the conclusions for small sample sizes quite uncertain.

Example 5.18 Q: Consider the FHR dataset containing measurements of basal heart rate

frequency (beats per minute) made on 51 foetuses (see Appendix E). Use the sign test in order to assess whether the measurements performed by an automatic system (SPB) are comparable to the computed average (denoted AEB) of the measurements performed by three human experts.

A: There is a clear lack of fit of the distributions of SPB and AEB to the normal distribution. A non-parametric test has, therefore, to be used here. The sign test results, obtained with STATISTICA are shown in Table 5.22. At a 5% significance level, we do not reject the null hypothesis of equal measurement performance of the automatic system and the “average” human expert.

Table 5.22. Sign test results obtained with STATISTICA for the SPB-AEB comparison (FHR dataset).

No. of Non-Ties

Percent v < V

p-level

5.3.2.3 The Wilcoxon Signed Ranks Test

The Wilcoxon signed ranks test uses the magnitude of the differences d i =x i – y i , which the sign test disregards. One can, therefore, expect an enhanced power- efficiency of this test, which is in fact asymptotically 95.5%, when compared with its parametric counterpart, the t test. The test ranks the d i ’s according to their magnitude, assigning a rank of 1 to the d i with smallest magnitude, the rank of 2 to the next smallest magnitude, etc. As with the sign test, x i and y i ties (d i =

0) are removed from the dataset. If there are ties in the magnitude of the differences,

210 5 Non-Parametric Tests of Hypotheses

these are assigned the average of the ranks that would have been assigned without ties. Finally, each rank gets the sign of the respective difference. For the MC and MA variables of Example 5.17, the ranks are computed as:

Signed Ranks: 3 –3 3 3 –6.5 3 –6.5

Note that all the magnitude 1 differences are tied; we, therefore, assign the average of the ranks from 1 to 5, i.e., 3. Magnitude 2 differences are assigned the average rank (6+7)/2 = 6.5.

The Wilcoxon test uses the test statistic:

T + = sum of the ranks of the positive d i . 5.36

The rationale is that under the null hypothesis − samples are from the same population or from populations with the same median − one expects that the sum of the ranks for positive d i will balance the sum of the ranks for negative d i . Tables of the sampling distribution of T + for small samples can be found in the literature. For large samples (say, n > 15), the sampling distribution of T + converges asymptotically, under the null hypothesis, to a normal distribution with the following parameters:

A test procedure similar to the t test can then be applied in the large sample case. Note that instead of T + the test can also use T – the sum of the negative ranks.

Table 5.23. Wilcoxon test results obtained with SPSS for the SPB-AEB comparison ( FHR dataset) in Example 5.19: a) ranks, b) significance based on negative ranks.

N Mean Rank Sum of Ranks

AE − SP

Negative Ranks 18

Positive Ranks

Asymp. Sig.

Total 51 (2-tailed)

5.3 Inference on Two Populations 211

Example 5.19 Q: Redo the two-sample comparison of Example 5.18, using the Wilcoxon signed

ranks test.

A: The Wilcoxon test results obtained with SPSS are shown in Table 5.23. At a 5% significance level, we reject the null hypothesis of equal measurement performance of the automatic system and the “average” human expert. Note that the conclusion is different from the one reached using the sign test in Example 5.18.

In R the command wilcox.test(SPB, AEB, paired = TRUE) yields the same “p-value”. ฀

Example 5.20

Q: Estimate the power of the Wilcoxon test performed in Example 5.19 and the needed value of n for reaching a power of at least 90%.

A: We estimate the power of the Wilcoxon test using the concept of power- efficiency (see formula 5.1). Since Example 5.19 involves a large sample (n = 51), the power-efficiency of the Wilcoxon test is of about 95.5%.

Figure 5.7a shows the STATISTICA specification window for the dependent samples t test. The values filled in are the sample means and sample standard deviations of the two samples, as well as the correlation between them. The “Alpha” value is the previous two-tailed observed significance (see Table 5.22).

The value of n, using formula 5.1, is n = n A = 0.955 × 51 ≈ 49. STATISTICA computes a power of 76% for these specifications.

The power curve shown in Figure 5.7b indicates that the parametric test reaches

a power of 90% for n A = 70. Therefore, for the Wilcoxon test we need a number of samples of n B = 70/0.955 ≈ 73 for the same power. ฀

Figure 5.7. Determining the power for a two-paired samples t test, with STATISTICA: a) Specification window, b) Power curve dependent on n.

212 5 Non-Parametric Tests of Hypotheses