One way analysis of variance

One–way analysis of variance
In all of the regression models examined so far, both the target and predicting variables
have been continuous, or at least effectively continuous — with one exception. Our analysis
of the pooled / constant shift / full model hierarchy recognized that the existence of two
well–defined subgroups in the data could have predictive power for the target variable.
That is, a categorical predicting variable taking on the values 0 and 1 could be used to
address the effect of being in one or the other subgroup.
A natural question is to wonder if this can be generalized to more than two groups.
For example, does knowing the educational level of a person (say High school, College,
or Postgraduate) have predictive power for their annual salary? Does knowing the religion of a member of Congress (officially reported as Protestant, Catholic, Jewish, or
Other/unknown) say anything about how much money they accept from certain political
action committees (PACs)? Is the return on a stock related to the industry group of the
company? This is a regression question, but a special kind of regression question; in this
context, saying that group membership has predictive power for the target is the same as
saying that the average value of the target is different for different groups. That is, this is
a question of comparison of means.
Consider the simplest situation of one categorical predicting variable that takes on K
values. The one–way analysis of variance (ANOVA) model is as follows:
yij = µ + αi + εij ,

i = 1, . . . , K,


j = 1, . . . , ni

(1)

where yij is the value of y for the j th member of the ith group, µ is an overall level (roughly
corresponding to the overall mean), αi is the effect of being in the ith group, εij is the
error term, and ni is the number of observations that fall in the ith group.
The α terms represent the difference in E(y) that comes from being in any particular
group. It is natural to say that αi = 0 for all i if there is no difference between groups,
but we have to be careful. Say we have three groups corresponding to High school, College
and Postgraduate degree. If the average salary for each group was $30,000, that would
obviously correspond to there not being a Degree effect on salary. This could be modeled
c 2012, Jeffrey S. Simonoff


1

in the natural way
µ = 30, 000;


α1 = α2 = α3 = 0,

but it could also be modeled as
µ = 20, 000;

α1 = α2 = α3 = 10, 000.

The latter set of parameters doesn’t reflect what we want. For this reason, an additional
restriction is put on equation (1),
K
X

αi = 0.

i=1

With this additional constraint, it is guaranteed that a situation with no group effect will
be modeled with α = 0.
The model (1) can be written easily as a regression model:

y = β0 x0 + β1 x1 + · · · + βK xK + ε,

(2)

where
 

 

0
 
 .. 
 
.
 
 
1
 
 
x

=
0
K
 .. 
.
.
.
 


x0 = .
 
 
1
 
.
1
.
.
 

1
P
Note that x0 = x1 + · · · + xK ; this is why we need the condition i αi = 0. The regression
1
 
 .. 
.
 
 
 
x1 =  1 
 
 
0
 
.
.
.
 
0


form (2) shows that the overall F –test of the equality of the slope coefficients to zero

(β1 = · · · βK = 0) is testing if E(yij ) = µ for all i (that is, no difference in expected target
for the different groups). As would be expected, ŷij = y i (that is, the fitted value for any
observation in group i is the sample mean of y for the observations in that group).
If you attempt to run a regression using x1 through xK as predictors you will get
P
an error message, since the i αi = 0 condition is not being used. The ANOVA can be

c 2012, Jeffrey S. Simonoff


2

fit using regression by regressing on K − 1, rather than K predictors. There are several
different ways to do this (it should be remembered that good statistical software usually
includes code devoted to one–way ANOVA, so it generally isn’t necessary to fit the model
explicitly as a regression).
(1) Drop any one indicator variable. If you do this, the group that corresponds to the

omitted variable represents a reference group. The constant term β̂0 corresponds to
the estimated y for that group, and each slope estimate β̂i corresponds to the difference
in estimated y between that of group i and the reference group. The individual t–
statistic for each variable can be used to test the significance of this difference. Thus,
if one group is a natural reference group, this is a natural way to fit the model (for
example, if y is the time until relapse of a medical condition, the groups represent
different dosages of a drug, and one group corresponds to a zero dosage [control]
group).
(2) If there is no natural reference group, a regression model where the coefficients don’t
treat one group as special is desirable. It’s possible to do this using special variables
called effect codings. Pick one group as a reference group (unlike for indicator variables,
it doesn’t matter which one). Say it’s group K. For i = 1, . . . , K −1, define a predictor
as
xi =

(

1
−1
0


if observation is in group i
if observation is in group K
otherwise.

Now the constant term β̂0 is an estimate of the overall level µ, and each slope estimate
β̂i corresponds to the effect of being in group i (αi ). Thus, this fit (rather than that
using indicator variables) is consistent with the notation of equation (2). The effect
PK−1
of being in the reference group (αK ) is simply − i=1 βi , since the α’s must sum to

0. The individual t–statistic for each variable can be used to test whether αi = 0.

Effect codings also turn out to be useful in situations with more than one categorical
(grouping) variable.
Whatever way the model is fit, it’s important to remember that these ANOVA models
are, in fact, regression models. All of the usual assumptions on εi still hold. A particularly
important one in this context is the constant variance assumption, since we know (by
c 2012, Jeffrey S. Simonoff



3

definition) that well–defined subgroups do exist in the data.
Say the overall F –test is significant; that is, there is a significant difference in the
average target variable value between groups. Which groups are different from each other?

This is a multiple comparisons question. We could look at all I = K
2 = K(K − 1)/2

pairs, and test each using an indicator variable fit with one of the groups as the reference.

However, at a .05 level (what is termed a pairwise error rate), 5% would be significant by

random chance! So, for example, if there are 7 groups (K = 7), I = 21
2 = 21 tests would

be made, implying that on average one pair would be assessed as statistically significantly
different even when there is no difference between any of the groups (this approach is
sometimes called the Fisher method, or the method of least significant difference).

Multiple comparisons procedures correct for this by controlling the experimentwise
error rate. An experimentwise rate of .05 says that in repeated sampling from a population
where there is no difference between groups, only 5% of the time would any pair of groups
be considered significantly different from each other. There are many different approaches
to handling multiple comparisons, the most common of which are the Bonferroni and
Tukey methods. The Bonferroni method argues that if the experimentwise error rate is
desired to be α, each pairwise test should be done at an α/I level. So, for example, for
K = 7, each pairwise t–test would be done at a significance level of .05/21 = .00238. The
Bonferroni method is very general and very easy to apply, and usually does a good job of
controlling the experimentwise error rate. Its only drawback is that it can sometimes be
too conservative (that is, it does not reject the null when it should).
The Tukey method is a multiple comparison method specifically derived for ANOVA
multiple comparisons. As such it is less general than the Bonferroni approach, but is
usually less conservative (particularly if the design is balanced).
An alternative approach to the multiple comparisons problem introduced in the last
10-15 years is based on controlling the false discovery rate, which is the expected proportion
of falsely rejected hypotheses among all rejected hypotheses. If all of the null hypotheses
are true (that is, in the ANOVA context all of the group means are equal to each other) this
is the same as the experimentwise rate controlled by the Bonferroni and Tukey methods,
but when some of the null hypotheses are not true it is easier to reject the null, therefore

c 2012, Jeffrey S. Simonoff


4

making the test more sensitive and less conservative. Minitab does not provide this method
at this time.
As we noted earlier, the ANOVA situation (where there are by definition well-defined
subgroups in the data) is one where heteroscedasticity (nonconstant variance) is common,
with the errors for observations from different subgroups having different variances. This
is a clear violation of the assumptions of ordinary least squares, but fortunately, there is a
direct cure for the problem: weighted least squares.
The idea behind weighted least squares (WLS) is that least squares is still a good
thing to do if the target and predicting variables are transformed to give a model with
errors with constant variance. Say V (εi ) = σi2 . To keep things simple, consider a simple
regression model, although everything here carries over directly to multiple regression and
ANOVA situations. The regression model is
yi = β0 + β1 xi + εi .
If we divide both sides of this equation by σi we get
yi
= β0
σi



1
σi



+ β1



xi
σi



+

εi
.
σi

This can be rewritten
yi∗ = β0 z1i + β1 z2i + δi ,
where yi∗ , z1i , z2i , and δi are the obvious substitutions from the previous equation and
V (δi ) = 1 for all i. Thus, ordinary least squares (OLS) estimation (without an intercept
term) of y ∗ on z1 and z2 gives fully efficient estimates of β0 and β1 . Note that using
a constant multiple of σi works just as well, since the only requirement is that V (δi ) be
constant for all i. Any good statistical package will include an option for providing a weight
variable for WLS; while the standardization described here is going on in the background,
it is completely invisible to the user.
The value WTi = 1/σi2 is the ith value of the weighting variable. Ordinary least squares
is a special case of WLS with WTi = 1 for all i (and, in fact, most regression packages only
include code for WLS, with OLS the default special case). The problem is that σi2 is
c 2012, Jeffrey S. Simonoff


5

unknown, and must be estimated. Fortunately, this is easy to do in the ANOVA situation.
Consider a situation where there is a predictor defining K subgroups in the data. The key
is to assume that the errors for all of the observations that come from group j (say) have
the same variance, σj2 (note that these values are allowed to be different from one group to
another). The weight for each of the observations from group j would then be 1/σ̂j2 , where
σ̂j2 is the estimate of the variance of the errors in the jth group. An estimate that is then
easily available is to just separate the residuals by group membership, and then estimate
σj2 using the sample variance of the residuals for the members of group j.
It is important to recognize the reasons behind the use of weighted least squares. The
goal is not to improve measures of fit like R2 or F ; rather, the goal is to analyze the
data in an appropriate fashion. There are several advantages to addressing nonconstant
variance:
(1) The estimates of the regression coefficients are more efficient. That is, on average,
the WLS estimates should be closer to the true regression coefficients than the OLS
estimates are.
(2) More importantly, predictions are more sensible. If the underlying variability of a
certain type of observation is larger than that for another type of observation, the
prediction interval should reflect that. This is not done under OLS, but it is under
WLS. In particular, a rough prediction interval for the ith observation is no longer

±2s (using Minitab’s notation), but is rather ±2s/ WTi , since that corresponds to
±2σ̂i .
Say we are in a situation where the categories have a natural ordering. We might
wonder if that ordering corresponds to a numerical scale. For example, say the target
variable is a person’s salary, and the grouping variable is the amount of schooling the
person has (High school, College, Postgraduate). Is the average change in salary when
going from High School education to College education roughly the same as when going
from College to Postgraduate? That is, is the relationship between salary and schooling
linear if schooling is on an equispaced scale of (say) 1, 2, 3? We can investigate this
question using a partial F –test.
Let Linear be a numerical variable that corresponds to the natural ordering of the
c 2012, Jeffrey S. Simonoff


6

groups, such as 1, 2, 3. The question is whether an ordering in y is implied by the ordering
of the groups. Consider the following two situations:
Group

Average salary

Average salary

High school
College
Postgraduate

$20,000
$35,000
$50,000

$20,000
$35,000
$65,000

In the first case, salary is linearly related to education level, since each increase in
education level is associated with a constant change in average salary. A regression model
on only Linear would fit these data well. That is a good thing, since the model on only
Linear is simpler than the full ANOVA model (it requires only two parameters, rather
than three). On the other hand, the second case is one where the average increase in salary
from College to Postgraduate is twice as large as that from High school to College. This
is not a linear relationship, and the model based on only Linear would not fit the data.
A partial F –test comparing the full ANOVA model to the model on only Linear is a test
of whether the simpler model is adequate; if the test is not statistically significant, then
the simpler model is adequate.

c 2012, Jeffrey S. Simonoff


7