For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, we can from the above statement estimate that approximately 95 of the scores will fall in the
range of 20.875-27.0799 to 20.875+27.0799 or between 6.7152 and 35.0348. This kind of information is a critical stepping stone to enabling us to compare the performance of an
individual on one variable with their performance on another, even when the variables are measured on entirely different scales.
C. Descriptive statistics for measurements of a single variable
1. The basic idea
We now deal with descriptive statistics for measurements of a single variable. It is imagined that we have a large population of values from which we take samples. The population
could consist of the diameters of automobile drive shafts produced in a given plant. To make sure the manufacturing equipment continues to operate satisfactorily, we measure the
diameter of every tenth drive shaft.
1
The measurements over a given time period are called “samples” of the “population” of all drive shafts. The measurements will vary somewhat,
both because of finite tolerances in the manufacturing equipment and because of uncertainties in the measurements themselves. From the samples, we wish to make
judgments about the underlying population, i.e. the actual diameters of all drive shafts made. For example, the mean average of the samples is expected to be approximately the true
unknown mean of the population. The accuracy of this sample estimate of the population mean would be expected to improve as the sample size is increased. For example, if we
measured every other drive shaft, we would expect the mean of our measurements to become closer to the actual average diameter of all drive shafts than when we measured
only 110 of them. One of the primary objectives of statistics is to make quantitative statements. For example,
rather than just saying that the average drive shaft diameter is approximately equal to the sample mean, we’d like to give a range of diameters within which the true mean lies with a
probability of 95.
2. The normal distribution
The most common assumption made in statistical treatments of data is that the probability of a particular value x deviating from the population mean
is inversely proportional to the square of its deviation from the mean. This gives rise to the familiar “bell-shaped curve”
normal probability density function:
1
While we could measure every drive shaft, this is unnecessarily expensive.
2 2
2 x
e 2
1 x
f
2.2 where
2
the population variance, which is the mean of all values of x -
2
. The factor
2 1
was chosen so that
1 dx
x f
. The probability that a given sample x lies between a and b is
b a
dx x
f
,
2
which gives the fundamental meaning of the probability density function f.
To illustrate the normal distribution, we present on the next page a MATLAB program to generate normally-distributed random numbers and compare the resulting histogram with
equation 2.2 . To save time, you can cut and paste this program into MATLAB’s Editor,
save in your working directory as ranhys.m, and then execute in MATLAB’s Command window by typing ranhys. Try it for several values of the mean, variance and number
of values, n. Notice how the histogram approaches the shape
3
of the normal distribution better and better as n is increased. A histogram for
= 5,
2
= 2 and n = 500 is given as Figure 2.1 on the next page.
ranhys.m W.R. Wilcox, Clarkson University, 1 June 2004.
Comparison of a histogram of normally distributed random numbers with a normal distribution.
n is the number of samples sigma is the sample standard deviation
mu is the sample mean X is the vector of values
clear n = inputEnter the number of values to be generated ;
mu = inputEnter the population mean ; sigsq = inputEnter the population variance ;
sigma = sqrtsigsq; Set the state for the random number generator
See help randn randnstate,sum100clock;
Generate the random numbers desired X = mu + sigmarandnn,1;
Plot the histogram with 10 bins see help hist histX,10, xlabelvalue, ylabelnumber in bin
h = findobjgca,Type,patch; seth,FaceColor,m,EdgeColor,w
hold on Now create a curve for the normal distribution
with a maximum equal to 14 of the number of values n x = mu-4sigma:sigma100:mu+4sigma;
2
That is, the area under the fx curve between a and b.
3
Compare only the shape, as here the maximum in the normal distribution is arbitrarily set to n4.
f = 0.25nexp-x-mu.22sigma.2; plotx,f, legendrandom number, normal distribution
titleComparison of random number histogram with normal distribution shape
hold off
Do you get the same histogram if you use the same values again for ,
2
and n? Examine the code until you understand why.
Figure 3. Sample histogram for = 5,
2
= 2 and n = 500.
See http:www.shodor.orginteractivateactivitiesNormalDistribution for a graphical illustration of the influence of population standard deviation on the normal distribution and
the influence of bin size on a histogram.
3. Tests to see if a population is normally distributed