The Kolmogorov-Smirnov Goodness of Fit Test

5.1.4 The Kolmogorov-Smirnov Goodness of Fit Test

The Kolmogorov-Smirnov goodness of fit test is a one-sample test designed to assess the goodness of fit of a data sample to a hypothesised continuous

distribution, F X (x). The null hypothesis is formalised as:

H 0 : Data variable X has a cumulative probability distribution F X (x) ≡ F(x).

Let S n (x) be the observed cumulative distribution of the random sample, x 1 ,

x 2 ,…, x n , also called empirical distribution. Assuming the sample data is sorted in increasing order, the values of S n (x) are obtained by adding the successive frequencies of occurrence, k i /n, for each distinct x i .

Under the null hypothesis one expects to obtain small deviations of S n (x) from F(x). The Kolmogorov-Smirnov test uses the largest of such deviations as a goodness of fit measure:

D n = max | F(x) −S n (x) |, for every distinct x i .

The sampling distribution of D n is given in the literature. Unless n is very small the following asymptotic result can be used:

i 1 2 i 2 t lim 2 P

The Kolmogorov-Smirnov test rejects the null hypothesis at level α if

D n > d n , α , where d n , α is such that:

P H 0 ( D n > d n , α ) = α . 5.12

Using formula 5.11 the following critical points are obtained:

d n , 0 . 01 = 1 . 63 / n ;

d n , 0 . 05 = 1 . 36 / n ;

d n , 0 . 10 = 1 . 22 / n . 5.13

184 5 Non-Parametric Tests of Hypotheses

Note that when applying the Kolmogorov-Smirnov test, one often uses the distribution parameters computed from the actual data. For instance, in the case of assessing the normality of an empirical distribution, one often uses the sample mean and sample standard deviation. This is a source of uncertainty in the interpretation of the results.

Example 5.8

Q: Redo the previous Example 5.7 (assessing the normality of ART for class 1 of the cork-stopper data), using the Kolmogorov-Smirnov test.

A: Running the test with SPSS we obtain the results displayed in Table 5.8, showing no evidence (p = 0.8) supporting the rejection of the null hypothesis (normal distribution). In R the test would be run as:

> x <- ART[1:50] > ks.test(x, “pnorm”, mean(x), sd(x))

The following results are obtained confirming the ones in Table 5.8:

D = 0.0922, p-value = 0.7891 ฀

Table 5.8. Kolmogorov-Smirnov test results for variable ART obtained with SPSS in the goodness of fit assessment of normal distribution.

ART N 50 Normal Parameters

Mean

42.9969 Most Extreme Differences

Std. Deviation

−0.092 Kolmogorov-Smirnov Z

Negative

0.652 Asymp. Sig. (2-tailed)

In the goodness of fit assessment of a normal distribution it may be convenient to inspect cumulative distribution plots and normal probability plots. Figure 5.2 exemplifies these plots for the ART variable of Example 5.8. The cumulative distribution plot helps to detect the regions where the empirical distribution mostly deviates from the theoretical distribution, and can also be used to measure the statistic D n (formula 5.10). The normal probability plot displays z-scores for the data and for the standard normal distribution along the vertical axis. These last ones lie on a straight line. Large deviations of the observed z-scores, from the straight line corresponding to the normal distribution, are a symptom of poor normal approximation.

5.1 Inference on One Population 185

1 ) x 0.9 F(

0.98 ili ty 0.8 0.95 ab rob P

b 40 60 80 100 120 140 160 180 200 220 240 Figure 5.2. Visually assessing the normality of the ART variable (cork stopper

dataset) with MATLAB: a) Empirical cumulative distribution plot with superimposed normal distribution (smooth line); b) Normal probability plot.

Commands 5.5. SPSS, STATISTICA, MATLAB and R commands used to perform goodness of fit tests.

Analyze; Nonparametric Tests; 1-Sample K-S SPSS

Analyze; Descriptive Statistics; Explore; Plots; Normality plots with tests

Statistics; Basic Statistics/Tables; STATISTICA Histograms Graphs; Histograms

MATLAB [h,p,ksstat,cv]= kstest(x,cdf,alpha,tail)

[h,p,lstat,cv]= lillietest(x,alpha) R

ks.test(x, y, ...)

With STATISTICA the one-sample Kolmogorov-Smirnov test is not available as a separate test. It can, however, be performed together with other goodness of fit tests when displaying a histogram ( Advanced option). SPSS also affords the goodness of fit tests with the normality plots that can be obtained with the Explore command.

With the MATLAB commands kstest and lillietest, the meaning of the parameters and return values when testing the data sample x at level alpha, is as follows:

cdf: Two-column matrix, with the first column containing the random

sample x and the second column containing the hypothesised cumulative distribution.

tail: Type of test with values 0, −1, 1 corresponding to the alternative

hypothesis F(x) ≠S n (x), F(x) > S n (x) and F(x) < S n (x), respectively. h:

Test result, equal to 1 if H 0 can be rejected, 0 otherwise.

186 5 Non-Parametric Tests of Hypotheses

p: Observed significance. ksstat, lstat: Values of the Kolmogorov-Smirnov and Liliefors statistics,

respectively. cv:

Critical value for significant test.

Some of these parameters and return values can be omitted. For instance,

h = kstest(x)only performs the normality test of x. The arguments of the R function ks.test are as follows:

A numeric vector of data values. y :

Either a numeric vector of expected data values or a character string naming a distribution function. ... Parameters of the distribution specified by y.

Commands 5.6. SPSS, STATISTICA, MATLAB and R commands used to obtain cumulative distribution plots and normal probability plots.

Graphs; Interactive; Histogram; Cumulative histogram

SPSS Analyze; Descriptive Statistics; Explore; Plots; Normality plots with tests | Graphs; P-P

Graphs; Histograms; Showing Type; Cumulative

STATISTICA Graphs; 2D Graphs; Probability-Probability Plots

MATLAB cdfplot(x) ; normplot(x) R

plot.ecdf(x) ; qqnorm(x)

The cumulative distribution plot shown in Figure 5.2a was obtained with MATLAB using the following sequence of commands:

» art = corkstoppers(1:50,3); » cdfplot(art) » hold on » xaxis = 0:1:250; » plot(xaxis,normcdf(xaxis,mean(art),std(art)))

Note the hold on command used to superimpose the standard normal distribution over the previous empirical distribution of the data. This facility is disabled with hold off. The normcdf command is used to obtain the normal cumulative distribution in the interval specified by xaxis with the mean and standard deviation also specified.

5.1 Inference on One Population