Test Errors and Test Power

4.2 Test Errors and Test Power

As described in the previous section, any decision derived from hypothesis testing has, in general, a certain degree of uncertainty. For instance, in the drill example there is always a chance that the null hypothesis is incorrectly rejected. Suppose that a sample from the good quality of drills has x =1190 hours. Then, as can be

seen in Figure 4.1, we would incorrectly reject the null hypothesis at a 10% significance level. However, we would not reject the null hypothesis at a 5% level, as shown in Figure 4.3. In general, by lowering the chosen level of significance, typically 0.1, 0.05 or 0.01, we decrease the Type I Error:

Type I Error: α = P(H 0 is true and, based on the test, we reject H 0 ).

The price to be paid for the decrease of the Type I Error is the increase of the Type II Error, defined as:

Type II Error: β = P(H 0 is false and, based on the test, we accept H 0 ).

For instance, when in Figures 4.1 and 4.3 we decreased α from 0.10 to 0.05, the value of β increased from 0.10 to:

β = P ( Z ≥ ( x α − µ A ) / σ X ) = P ( Z ≥ ( 1172 . 8 − 10 ) / 77 . 94 ) = 0 . 177 .

Note that a high value of β indicates that when the observed statistic does not fall in the critical region there is a good chance that this is due not to the verification of the null hypothesis itself but, instead, to the verification of a sufficiently close alternative hypothesis. Figure 4.4 shows that, for the same level

of significance, α, as the alternative hypothesis approaches the null hypothesis, the value of β increases, reflecting a decreased protection against an alternative hypothesis.

The degree of protection against alternative hypotheses is usually measured by the so-called power of the test, 1– β, which measures the probability of rejecting the null hypothesis when it is false (and thus should be rejected). The values of the

power for several alternative values of µ A , using the computed values of β as

4 Parametric Tests of Hypotheses

shown above, are displayed in Table 4.1. The respective power curve, also called operational characteristic of the test, is shown with a solid line in Figure 4.5. Note

that the power for the alternative hypothesis µ A = 10 is somewhat higher than 80%. This is usually considered a lower limit of protection that one must have against alternative hypothesis.

accept H 1 10

accept H critical region

Figure 4.4. Increase of the Type II Error, β, for fixed α, when the alternative hypothesis approaches the null hypothesis.

Table 4.1. Type II Error and power for several alternative hypotheses of the drill example, with n = 12 and α = 0.05.

In general, for a given test and sample size, n, there is always a trade-off between either decreasing α or decreasing β. In order to increase the power of a test for a fixed level of significance, one is compelled to increase the sample size. For the drill example, let us assume that the sample size increased twofold, n = 24. We now have a reduction of 2 of the true standard deviation of the sample mean, i.e., σ = 55.11. The distributions corresponding to the hypotheses are now more X peaked; informally speaking, the hypotheses are better separated, allowing a smaller Type II Error for the same level of significance. Let us confirm this. The new decision threshold is now:

4.2 Test Errors and Test Power 117

x α = µ B − 1 . 64 × σ X = 1300 − 1 . 64 × 55 . 11 = 1209 . 6 ,

which, compared with the previous value, is less deviated from µ B . The value of β for µ A = 10 is now:

β = P ( Z ≥ ( x α − µ A ) / σ X ) = P ( Z ≥ ( 1209 . 6 − 10 ) / 55 . 11 ) = 0 . 023 .

Therefore, the power of the test improved substantially to 98%. Table 4.2 lists values of the power for several alternative hypotheses. The new power curve is shown with a dotted line in Figure 4.5. For increasing values of the sample size n, the power curve becomes steeper, allowing a higher degree of protection against alternative hypotheses for a small deviation from the null hypothesis.

Power =1- β

Figure 4.5. Power curve for the drill example, with α = 0.05 and two values of the sample size n.

Table 4.2. Type II Error and power for several alternative hypotheses of the drill example, with n = 24 and α = 0.05.

STATISTICA and SPSS have specific modules − Power Analysis and SamplePower, respectively − for performing power analysis for several types of tests. The R stats package also has a few functions for power calculations. Figure 4.6 illustrates the power curve obtained with STATISTICA for the last example. The power is displayed in terms of the standardised effect, E s , which

4 Parametric Tests of Hypotheses

measures the deviation of the alternative hypothesis from the null hypothesis, normalised by the standard deviation, as follows:

E s . σ 4.1

For instance, for n = 24 the protection against µ A = 10 corresponds to a standardised effect of (1300 − 1100)/260 = 0.74 and the power graph of Figure 4.6

indicates a value of about 0.94 for E s = 0.74. The difference from the previous value of 0.98 in Table 4.2 is due to the fact that, as already mentioned, STATISTICA uses the Student’s t distribution.

1.0 r e .9

Pow .8 .7 .6 .5 .4 .3 .2 .1

Standardized Effect (Es) 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Figure 4.6. Power curve obtained with STATISTICA for the drill example with α = 0.05 and n = 24.

In the work of Cohen (Cohen, 1983), some guidance is provided on how to qualify the standardised effect:

Small effect size:

E s = 0.2.

Medium effect size:

E s = 0.5.

Large effect size:

E s = 0.8.

In the example we have been discussing, we are in presence of a large effect size. As the effect size becomes smaller, one needs a larger sample size in order to obtain a reasonable power. For instance, imagine that the alternative hypothesis

had precisely the same value as the sample mean, i.e., µ A =1260. In this case, the standardised effect is very small, E s = 0.148. For this reason, we obtain very small values of the power for n = 12 and n = 24 (see the power for µ A =1250 in Tables

4.1 and 4.2). In order to “resolve” such close values (1260 and 1300) with low errors α and β, we need, of course, a much higher sample size. Figure 4.7 shows

how the power evolves with the sample size in this example, for the fixed

4.2 Test Errors and Test Power 119

standardised effect E s = −0.148 (the curve is independent of the sign of E s ). As can

be appreciated, in order for the power to increase higher than 80%, we need n > 350.

Note that in the previous examples we have assumed alternative hypotheses that are always at one side of the null hypothesis: mean lifetime of the lower quality of drills. We then have a situation of one-sided or one-tail tests. We could as well contemplate alternative hypotheses of drills with better quality than the one corresponding to the null hypothesis. We would then have to deal with two-sided or two-tail tests. For the drill example a two-sided test is formalised as:

H 0 : µ =µ B .

H 1 : µ ≠µ B .

We will deal with two-sided tests in the following sections. For two-sided tests the power curve is symmetric. For instance, for the drill example, the two-sided power curve would include the reflection of the curves of Figure 4.5, around the

point corresponding to the null hypothesis, µ B .

1.0 Power vs. N (Es = -0.148148, Alpha = 0.05) .9

Sample Size (N) 0.0

Figure 4.7. Evolution of the power with the sample size for the drill example, obtained with STATISTICA, with α = 0.05 and E s = −0.148.

A difficulty with tests of hypotheses is the selection of sensible values for α and β. In practice, there are two situations in which tests of hypotheses are applied:

1. The reject-support (RS) data analysis situation

This is by far the most common situation. The data analyst states H 1 as his belief, i.e., he seeks to reject H 0 . In the drill example, the manufacturer of the new type of drills would formalise the test in a RS fashion if he wanted to claim that the new brand were better than brand A:

4 Parametric Tests of Hypotheses

H 0 : µ ≤µ A =1100.

H 1 : µ >µ A .

Figure 4.8 illustrates this one-sided, single mean test. The manufacturer is interested in a high power. In other words, he is interested that when H 1 is true (his belief) the probability of wrongly deciding H 0 (against his belief) is very low. In the case of the drills, for a sample size n = 24 and α = 0.05, the power is 90% for the alternative µ = x , as illustrated in Figure 4.8. A power above 80% is often considered adequate to detect a reasonable departure from the null hypothesis.

On the other hand, society is interested in a low Type I Error, i.e., it is interested in a low probability of wrongly accepting the claim of the manufacturer when it is false. As we can see from Figure 4.8, there is again a trade-off between a low α and a low β. A very low α could have as consequence the inability to detect a new useful manufacturing method based on samples of reasonable size. There is a wide consensus that α = 0.05 is an adequate value for most situations. When the sample

sizes are very large (say, above 100 for most tests), trivial departures from H 0 may

be detectable with high power. In such cases, one can consider lowering the value of α (say, α = 0.01).

Figure 4.8. One-sided, single mean RS test for the drill example, with α = 0.05 and n = 24. The hatched area is the critical region.

2. The accept-support (AS) data analysis situation

In this situation, the data analyst states H 0 as his belief, i.e., he seeks to accept H 0 . In the drill example, the manufacturer of the new type of drills could formalise the test in an AS fashion if his claim is that the new brand is at least better than brand B:

H 0 : µ ≥µ B =1300.

H 1 : µ <µ B .

4.3 Inference on One Population 121

Figure 4.9 illustrates this one-sided, single mean test. In the AS situation, lowering the Type I Error favours the manufacturer. On the other hand, society is interested in a low Type II Error, i.e., it is interested in a low probability of wrongly accepting the claim of the manufacturer,

H 0 , when it is false. In the case of the drills, for a sample size n = 24 and α = 0.05, the power is 17% for the alternative µ = x , as illustrated in Figure 4.9. This is an unacceptable low power. Even if we relax the Type I Error to α = 0.10, the power is still unacceptably low (29%). Therefore, in this case, although there is no evidence supporting the rejection of the null hypothesis, there is also no evidence to accept it either.

In the AS situation, society should demand that the test be done with a sufficiently large sample size in order to obtain an adequate power. However,

given the omnipresent trade-off between a low α and a low β, one should not impose a very high power because the corresponding α could then lead to the rejection of a hypothesis that explains the data almost perfectly. Again, a power value of at least 80% is generally adequate.

Note that the AS test situation is usually more difficult to interpret than the RS test situation. For this reason, it is also less commonly used.

Figure 4.9. One-sided, single mean AS test for the drill example, with α = 0.05 and n = 24. The hatched area is the critical region.