202 S.G. Rivkin Economics of Education Review 20 2001 201–209
Moreover, the available variables such as family income and parental education are unlikely to account for all fac-
tors that are related to both outcomes and the choice of neighborhood and school. Consequently single equation
techniques almost certainly do not identify true peer group effects regardless of the number of included
covariates.
The problem of endogeneity bias has prompted a search for alternative methods, leading some researchers
to use data that are aggregated to the state, county, or metropolitan area level as instruments for school or
neighborhood data. They argue that aggregation reduces problems introduced by the endogeneity of school and
neighborhood choice, because school and neighborhood location decisions tend to occur within metropolitan
areas or states.
3
Yet aggregation may also exacerbate the biases that result from the omission of family, school, or
other factors that are correlated with both outcomes and peer group background. Though the theoretical effects of
aggregation on specification error are ambiguous, empirical evidence suggests that aggregation increases
rather than reduces omitted variables bias in the esti- mation of school resource effects, a closely related
topic.
4
In an influential recent paper, Evans, Oates, and Schwab 1992 use aggregate metropolitan area charac-
teristics to identify the effects of school peer group back- ground on teen pregnancy and high school drop out rates.
In sharp contrast to the single equation estimates, they find that there is no significant relationship between out-
comes and peer socioeconomic background once metro- politan area characteristics are used as instruments for
the school peer group measure. This pattern of results is consistent with the hypothesis that much if not all of the
observed relationship between outcomes and peer group variables results from the effects of unobserved family
influences.
However, the findings of Evans et al. 1992 do not provide convincing evidence that aggregation reduces
specification error. First, because the coefficient esti- mates are quite noisy, they are also consistent with posi-
tive and sizeable peer group effects. Second, the statisti- cal evidence offered in support of the validity of the
instruments is uninformative. Finally, the theoretical jus- tification for the methodology is not compelling because
of the aforementioned ambiguous effect of aggregation on omitted variables bias. Given the lack of both clear
3
See Card and Krueger 1996 for a discussion of the advan- tages of aggregate data in the estimation of school resource
effects.
4
See Grogger 1996 and Hanushek, Rivkin, and Taylor 1996 and for evidence on school resource effects, and Moffit
1995 for a general discussion of aggregation and specifi- cation error.
theoretical support for the methodology and statistical evidence of instrument validity, much stronger empirical
evidence is needed in order to evaluate the desirability of using aggregate data to identify peer group effects.
This paper provides additional evidence on peer group effects using a sample of non-Hispanic Black and White
women from the sophomore cohort of the High School and Beyond Longitudinal Survey HSB, US Department
of Education, 1986. Outcomes include standardized test scores, teen fertility, high school continuation, and non-
participation in either school or work following high school graduation. The HSB contains many advantages
over other data sets such as the NLSY used in the investigation of peer group effects, in particular the large
number of students sampled in each school and the avail- ability of test score data early in the high school career
that can be used to control for pre-existing differences in student achievement.
The empirical analysis focusses on the hypothesis that the use of aggregate information as instruments reduces
the magnitude of specification error. In contrast to the findings of Evans, Oates, and Schwab, the majority of
instrumental variable estimates are larger than the single equation estimates, and a number are statistically sig-
nificant at conventional levels. This pattern is consistent with prior evidence that aggregation tends to exacerbate
specification error in the estimation of education pro- duction functions, and it raises serious doubts about the
use of aggregate data as a way to identify peer group, school, or neighborhood effects of any kind.
2. Data
The High School and Beyond Longitudinal Survey HSB is an ideal data set with which to investigate the
influence of peer group characteristics on academic and social outcomes. Approximately 24,000 non-Hispanic
Blacks and Whites were first interviewed in 1980 when they were high school sophomores. The base year data
contain information on family and student backgrounds. Students also completed a battery of standardized tests
as a part of the interview. Follow-up surveys were con- ducted in 1982, 1984, 1986, and 1992, providing twelve
years of information on schooling, employment and fer- tility as well as a second battery of standardized tests
completed during the 1982 first follow-up. Only women are included in this study.
Four high school outcomes are examined: 1 12th grade test score; 2 teen fertility; 3 high school con-
tinuation; and 4 nonparticipation in school or work in the autumn following graduation. The 12th grade test
score is a linear combination of mathematics and reading scores, in which the mathematics score is weighted three
times as heavily as the reading score. The weights were determined in a regression of high school continuation
203 S.G. Rivkin Economics of Education Review 20 2001 201–209
on these two test scores and other background variables, consequently the composite score reflects the relative
importance of mathematics and reading scores in pre- dicting school attainment. Teen mother is a binary out-
come equal to one if a woman has a baby prior to Febru- ary of her senior year in high school. High school
continuation is a binary outcome equal to one if a student does not leave high school prior to graduation as of Feb-
ruary of the senior year. Nonparticipation is also a binary outcome, equal to one if the total of months worked or
in school between August 1 and December 31 following high school graduation the period corresponding to the
first college semester is less than three.
5
Together these outcomes provide considerable information on social and
labor market development. Administrative data provide information on the high
schools, including the percentage of students in the school
classified as
economically disadvantaged.
Because family income information is used in determin- ing eligibility for the school lunch program, the percent-
age of students who are economically disadvantaged is a commonly available variable that is often used in
empirical research. Yet other than reasons of data avail- ability, there is no compelling reason to use percent dis-
advantaged as opposed to an alternative measure of peer group family background. One appealing alternative is
parental education, which tends to be a much better pre- dictor of academic outcomes than income. Because
information on father’s education is missing for many students in the HSB, the average education of school-
mates’ mothers is used as a second peer group measure.
6
Individual, family, and community characteristics are included as controls. The individual background charac-
teristics include gender and race dummies and a stan- dardized pretest score. The family background measures
are parental schooling, family income, and dummy vari- ables indicating that family income is missing and that
the students did not know their mother’s or father’s edu- cation. Region and community type dummy variables are
also included in all specifications, while the instate tui- tion at the public university is included in the nonpartici-
pation specifications.
Two sets of community characteristics are used as
5
The use of a five month period corresponding to the fall semester of college rather than a single week to evaluate partici-
pation has the advantage of ignoring brief transitions. In this taxonomy, nonparticipants demonstrate a lack of attachment to
both school and the labor market for a substantial time period.
6
This variable is constructed from information on other schoolmates sampled in the High School and Beyond Survey.
Despite the stratified sample design, the sampling of students within schools is random. However, the small number of stu-
dents roughly 5 percent who do not report mother’s education are likely to be a non-random group. This may introduce a small
amount of bias into the coefficient estimates.
instruments. The first includes the four variables used by Evans et al. 1992: the unemployment, college com-
pletion and poverty rates and median family income.
7
The second set includes the male labor force nonpartici- pation rate and the female college completion rate.
8
The nonparticipation rate was chosen because unlike median
family income and the poverty rate, it does not confound geographic variation in the cost of living with real differ-
ences in economic activity and resources. The unemploy- ment rate is not used because prior evidence suggests
that school continuation, employment and perhaps even fertility decisions are affected by labor market con-
ditions.
9
In fact the local unemployment rate is included as an explanatory variable in the probit specifications
that use the second set of instruments. Because HSB provides little information on com-
munity environment, the community characteristics were taken from the 1980 US Census Public Use Micro data
A Sample US Department of Commerce, 1980.
10
Two levels of aggregation are used to define communities.
The first is county groups as defined in the Census micro-data. Some county groups comprise a number of
actual counties such as those located in rural areas, while others are composed of a single city or several
communities which are a part of a single county. The second definition of community is the standard metro-
politan statistical area, also as defined by the Census.
The analyses of test scores, teen fertility, high school continuation, and post-secondary nonparticipation use
different waves of the HSB survey. The first follow-up survey is used in the analyses of test scores, teen fertility,
and high school continuation in order to take advantage of the much larger first follow-up sample size. That
explains why teen fertility and high school continuation are examined as of February of the senior year. Post-
secondary nonparticipation is studied with a sample taken from the second follow-up survey.
7
Based on the variable descriptions of Evans et al. 1992, the college completion rate is computed over adults 23 to 64
years old, and the unemployment rate is computed over adults 19 to 64 years old.
8
These two community characteristics are computed over individuals 20–49 years old. Older residents are excluded
because the decisions of younger residents are more likely to be influenced by cohorts closer to their own age. Separate calcu-
lations are performed by race.
9
See Rivkin 1995 for a discussion of labor market effects on schooling and employment decisions.
10
The information on university tuition is taken from Peter- son’s Guide to Four Year Colleges 1983. See Rivkin 1995
for a description of the community variables and the linking of schools to county groups in the HSB.
204 S.G. Rivkin Economics of Education Review 20 2001 201–209
3. Empirical model