Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji 1.full

(1)

Jacob L. Vigdor

Charles T. Clotfelter

a b s t r a c t

Using data on applicants to three selective universities, we analyze a col-lege applicant’s decision to retake the SAT. We model this decision as an optimal search problem, and use the model to assess the impact of col-lege admissions policies on retaking behavior. The most common test score ranking policy, which utilizes only the highest of all submitted scores, provides large incentives to retake the test. This places certain applicants at a disadvantage: those with high test-taking costs, those attaching low values to college admission, and those with ‘‘pessimistic’’ prior beliefs regarding their own ability.

I. Introduction

As the nation’s premier college entrance exam, the SAT holds an undeniably important role in who gets into college, particularly into the most selec-tive colleges. Yet it has been the subject of intense scrutiny in recent years,1 with a growing list of colleges and the University of California system having made or proposed to make the test an optional admissions requirement.2 Critics of the test Jacob L. Vigdor is an assistant professor of public policy studies and economics at Duke University, Box 90245, Durham NC 27708, e-mail: jvigdor@pps.duke.edu. Charles T. Clotfelter is Z. Smith Reyn-olds Professor of public policy studies, economics, and law at Duke University and National Bureau of Economic Research, Box 90245, Durham, NC 27708, email: cltfltr@pps.duke.edu. The authors are grateful to Christopher Avery, Charles Brown, Philip Cook, Helen Ladd, three anonymous referees, and seminar participants at Duke, Vanderbilt, the APPAM 2001 fall conference, the 2002 AEA meet-ings, Chicago GSB, and the NBER higher education working group for helpful comments, to Gary Barnes for assistance in obtaining the data, and to Robert Malme and Margaret Lieberman for re-search assistance.

[Submitted March 2002; accepted May 2002]

ISSN 022-166X2003 by the Board of Regents of the University of Wisconsin System

1. The test is officially referred to as the SAT I. Formerly known as the Scholastic Aptitude Test, the exam is now named for its former acronym. We will refer to the exam as the SAT in this paper. 2. See Lemann (1999) and Schemo (2001).


(2)

Table 1

Frequency of Retaking the SAT, All U.S. Test Takers and Applicants to Three Universities

U.S., 1997 Applicants to Three Graduating Classa Universities for Fall 1998b

Number of students 1,119,984 22,678

Percentage who took SAT I

Once 50.7 17.8

Twice 38.1 48.7

Three times 9.6 27.0

Four times 1.4

Four or more times 6.5

Five times 0.2

Total 100.0 100.0

a. Based on students who took SAT I one to five times in their junior or senior years. Source: College Board,Handbook for the SAT Program 1999–2000(1999), Table 6b.

b. Source: College Board and unpublished admissions data on 1998 applicants to three universities, au-thors’ calculations.

cite various kinds of bias and decry the test as an inappropriate apparatus for selecting a ruling class. ‘‘The Big Test,’’ as Lemann (1999) terms it, has drawn a significant amount of attention from academic researchers. Several studies have attempted to explain variation in SAT scores across states (Graham and Husted 1993) or individ-ual test takers (Dynarski 1985). SAT scores have frequently been used as an outcome measure in evaluating characteristics of school systems (Dynarski and Gleason 1993; Southwick and Gill 1997; Card and Payne 1998; see Hanushek and Taylor 1990, for a critique of this strategy) or a measure of ability (Ballou and Podgursky 1995). Additional research has investigated the predictive relationship between SAT scores and college outcomes (Boldt, Centra, and Courtney 1986; Bowen and Bok 1998; Rothstein 2002).

Despite this considerable controversy and analysis, very little attention has been paid to test-taking behavior itself. One important component of this behavior is the tendency for many college applicants to take the test multiple times. Nationwide roughly half of these who take the SAT do so more than once, and the rate of retaking appears to be even higher for students applying to selective institutions, as indicated by the calculated rates of retaking among applicants to three selective universities, presented in Table 1. If these students are retaking the SAT in hopes of improving their standing with college admissions offices, these hopes are often fulfilled, owing to two factors. The first is the widespread policy stated by college admissions offices to use only the highest score (actually, the sum of the highest verbal and the highest math score, even if these scores were obtained on different dates) for purposes of ranking applicants, ignoring the scores from all other attempts. Although this policy


(3)

is not used universally, it is by far the most common among institutions that publicly reveal them.3

The second reason why retaking the SAT often pays off is the actual tendency for test takers to score higher when they retake the test. This tendency, revealed both in our data on applicants to three colleges and in nationwide College Board data, could theoretically be attributable to selection into the pool of retakers. We present evidence below to suggest that the gains associated with retaking the test are too large to be attributed to selection alone, and thus reflect benefits associated with familiarity or increases in knowledge between administrations.

From the standpoint of public policy, the subject of retaking is worth exploring for both equity and efficiency reasons. First, retaking is important to the extent that, in combination with the highest-score policy noted above, it affects admissions out-comes. It seems intuitive that the current highest-score policy provides an advantage to applicants with low costs of taking the test. Our results confirm this intuition. In light of the growing scrutiny of race-conscious college admissions criteria within the larger national debate over affirmative action, it is important to ask whether the ‘‘high-cost’’ applicants now disadvantaged by current policy are drawn selectively from certain groups—including racial minorities and the poor. If so, then the current policy almost surely warrants scrutiny.

Another reason to study retaking is allocative efficiency: such activity employs resources that have valuable alternative uses, and our results suggest that the current highest-score policy strongly encourages retaking.4 If colleges could obtain much the same information by following other policies that result in less test-taking, the current highest-score policy could justifiably be faulted as inefficient. It might be

3. We examined the admissions websites of the 50 top-ranked universities and the 50 top-ranked liberal arts colleges, according toU.S. News and World Report. Of fourteen that made a statement about how multiple SAT scores are treated in admissions decisions, ten explicitly stated the highest-score policy as described in the text, three specified the highest combined score at one sitting, and one said ‘‘primary consideration’’ would be given to the highest individual scores. Eight of the sampled colleges do not require the SAT, and the remainder stated no explicit policy on multiple scores.

The possibility exists, of course, that admissions committees, contrary to their stated policies, in practice make some adjustment in the case of applicants who have taken the test many times. Christopher Avery has shown us unpublished evidence that applicants who take more than one SAT are less likely to be admitted to selective colleges, controlling for their highest math and verbal scores and certain other characteristics. This evidence might indicate that multiple takers are penalized, or that multiple test tak-ing is correlated with negative factors observable to admissions officers but not econometricians. In any event, the magnitude of the penalty Avery observes is not sufficient to offset the benefits of retak-ing revealed below. The informal conversations we have had with admissions officers lead us to believe that most selective colleges do indeed follow the stated policy of using the highest math and verbal scores.

4. Consider the following rough estimate of the costs associated with retaking. The College Board reports that 1.3 million applicants seeking college admission in 2001 took the SAT, and the average applicant took the test 1.7 times. These figures imply that about one million unnecessary tests were administered to the high school class of 2001. Costs associated with each of these include the basic fee of $25, the value of time spent taking the test, and the disutility associated with the act of test-taking. Valuing the four hours of test-taking time at the Federal minimum wage, and setting the psychic costs equal to direct and opportunity costs, we arrive at an estimate of $90 million per year. A similar calculation can be performed for applicants taking the ACT rather than the SAT.


(4)

more efficient, for example, if colleges were to use the average of an applicant’s SAT scores instead of the highest, if such a policy reduced the amount of retaking without significant loss of information.5

This paper examines retaking and its consequences using data on the undergradu-ate applicants to three selective research universities. Section II describes the data and Section III examines the characteristics of those applicants who retake the SAT. We are especially interested in finding out whether the tendency to retake the SAT differs by gender, race, or socioeconomic status. The fourth section of the paper discusses the reasons why test scores tend to increase upon retaking. Section V dis-cusses a model of retaking. The applicant’s problem is analogous to one of optimal search: additional draws from a distribution of possible test scores can be had for a certain cost, and the applicant must decide whether the expected benefits of retaking exceed this cost. The sixth section reports the results of simulations that investigate the impact of college test score ranking policies on the frequency of retaking. The simulations are calibrated to match observed behavior under current policy. Alterna-tive test score ranking policies are compared along four criteria: accuracy (does rank-ing reflect true ability?), precision (are the rankrank-ing errors small?), bias (does the policy disproportionately favor certain groups?), and resource cost (how costly is the policy in terms of time and money spent taking the test?). Of nine policies com-pared, the current highest-score policy turns out to be the costliest, least accurate, and most biased. (It may serve other interests of colleges, however, as we note in the paper’s final section.) Following the simulations, we use our data to determine the impact that one particular policy change, limiting consideration to a student’s first SAT only, might have on applicant rankings. Section VII concludes the analysis. Although the data that we use are very instructive, two of their limitations should be noted at the outset. First, the institutions to which these data apply are not repre-sentative of all colleges and universities in the country. Thus, the results should be thought of as applying most to institutions with selective admissions. Second, no information is available in this data set on the potentially important activity of coach-ing and test preparation courses. If equity concerns are raised by differences between groups in the frequency of retaking, then differences in the access to test preparation courses should also be of concern.6 Given the nature of our data, we are simply unable to address this issue. Moreover, we would emphasize that our analysis is incomplete to the extent that it focuses on one aspect of behavior—retaking—but not on other aspects that might be involved as individuals respond to incentives created by colleges’ admissions policies. Two such aspects are decisions about when to take the test and what kinds of preparation to make before taking the test. We return to this issue in Section VII.

5. As discussed in the concluding section, efficiency gains from changing the test score ranking policy may be reduced or reversed if applicants respond to the policy change by substituting into costlier forms of securing advantages in college admissions. Moreover, we may overstate the efficiency losses associated with retaking if applicants actually increase their human capital by learning something in the process. 6. Powers and Rock (1999) examine coaching and the claims made by some companies providing such services. Although coaching increases scores less than what is suggested in some of the claims, it does appear to lead to some improvement in scores. In addition, those receiving coaching tended to have higher incomes than those that did not.


(5)

II. Data

The data used in the present analysis are based on first-year under-graduate applicants to three research universities in the South, two public and one private. All three are selective institutions. For the class enrolling in the fall of 1998, the three institutions accepted an average of 42 percent of their applicants, and their average yield rate was 50 percent.7All three require the SAT I as part of the complete application. For each applicant, information was obtained from the Educational Test-ing Service (ETS) from its Student Descriptive Questionnaire (SDQ), which is filled out by students taking the SAT. This questionnaire provides the applicant’s race, gender, residence, high school academic performance, and self-assessed ability as well as information on the income and education of the applicant’s parents. The ETS also provided a complete history of SAT scores, regardless of whether the student reported all those scores to the institution. These data were matched with information from the college applications.

The resulting sample included 22,678 students who applied to at least one of the three institutions for the fall of 1998 and who also took the SAT at least once.8Of these, more than 82 percent took the SAT at least twice, compared to the 49 percent who were multiple test-takers nationwide, as shown in Table 1. This large difference most likely reflects the comparatively selective character of the institutions in the current sample and therefore the possibly more competitive nature of the applicants to those institutions. The differences may be further influenced by the fact that rela-tively few applicants in the South take the ACT, as compared to the Midwest and West, where many applicants might conceivably be taking the SAT only once. What-ever the cause, this difference should be noted.

III. Who Retakes the SAT?

Table 2 presents distributions showing what kinds of students in our sample most often took the SAT multiple times. Retaking was significantly more common among students who receive lower scores on the first test administration. By gender, women were more likely to take the test more than once; whereas 20 percent of men took the test only once, only 16 percent of women did. Most of this difference is reflected in the percentages who took the test three times. By race, blacks were somewhat more likely to take the test multiple times (83.5 versus 82.2 percent). Hispanic applicants were about as likely as blacks to take the test more than once but less likely than blacks or whites to take it more than twice. The most distinctive racial group was Asian Americans, who exceeded all other groups in their rate of retaking. Whereas 32.5 percent of whites took the SAT three or more times, 42.6 percent of Asian Americans did. With respect to parents’ income, no very clear

7. Data fromPeterson’s Guide to 4 Year Colleges, 30th Edition(2000).

8. Excluded from the sample are 1,664 applicants we could not match to neighborhood demographic data on the basis of their reported zip code. Inclusion of these applicants, where feasible, does not substantially affect any of the results presented here.


(6)

Table 2

Number of Times Taking SAT, by Selected Characteristics

Percentage, by number of times taking SAT Number of

Applicants 1 2 3 4 or more

All test takers 22,678 17.8 48.7 27.0 6.5

Combined score on initial test

1500 and above 661 83.1 15.0 2.0 0.0

1300–1490 5,870 29.0 54.2 15.4 1.5

1100–1290 9,084 12.2 51.8 30.2 5.8

1090 or below 7,063 9.7 43.1 34.8 12.4

Gender

Male 10,286 20.0 48.7 25.0 6.3

Female 12,392 16.0 48.6 28.6 6.8

Race

White 16,935 18.2 49.3 26.4 6.1

Black 2,358 16.5 49.0 27.0 7.5

Native American 147 17.7 47.6 27.2 7.5

Asian American 2,096 15.1 42.4 32.9 9.7

Hispanic 642 16.4 53.3 25.4 5.0

Other 500 24.4 47.0 22.4 6.2

Approximate parents’ income

Less than $40,000 3,561 17.9 49.0 26.5 6.7

$40–$60,000 3,404 15.9 48.9 27.9 7.3

$60–$80,000 3,396 15.6 48.9 27.9 7.5

$80–$100,000 2,530 17.2 48.8 28.3 5.7

More than $100,000 5,675 19.4 48.2 26.4 6.0

Unknown, not reported 4,112 19.5 48.5 25.8 6.2 Self-reported class rank

Top 10% 9,967 19.6 47.3 26.6 6.5

11% to 40% 8,084 15.8 50.1 27.4 6.7

Bottom 60% 1,229 16.2 49.3 28.5 6.0

Unknown, not reported 3,398 18.1 48.9 26.6 6.4 Self-reported math ability

Highest 10% 9,168 21.4 47.4 25.3 5.9

Above average, not top 10% 8,801 14.9 50.4 28.1 6.6 Average or below average 2,776 14.3 49.3 29.0 7.3 Unknown, not reported 1,933 18.9 45.7 27.0 8.4 Self-reported writing ability

Highest 10% 7,148 22.2 48.6 23.6 5.6

Above average, not top 10% 9,567 15.8 49.0 28.7 6.5 Average or below average 4,000 14.3 48.5 28.9 7.4 Unknown, not reported 1,963 18.9 45.8 26.9 8.4 Average income of home ZIP code

Less than 20% $50,000 or more 9,601 15.2 46.5 29.4 8.8 20–100% $50,000 or more 13,077 19.7 50.2 25.2 4.9


(7)

Table 2(continued)

Percentage, by number of times taking SAT Number of

Applicants 1 2 3 4 or more Urbanization of home ZIP code

Less than 80% urban 7,812 15.3 47.6 28.8 8.3

80–100% urban 14,866 19.2 49.2 26.0 5.6

Percentage black of home ZIP code

Less than 20% black 17,860 18.6 49.6 26.0 5.7

20–100% black 4,818 14.9 45.1 30.5 9.6

Note: Based on those who took the SAT at least once and graduated from high school in 1998. Row percentages may not add up to 100.0 due to rounding. Source: College Board and unpublished data on 1998 applicants to three universities, authors’ calculations.

patterns emerge.9Retaking was slightly more prevalent among applicants from fami-lies in the middle income categories, but the differences were not large. By contrast, the patterns of retaking differed markedly according to the student’s reported class rank. Those ranked in the top 10 percent of their high school class were least likely to take the test more than once. Applicants who ranked themselves among the highest 10 percent of students in either math or writing ability were significantly less likely to retake the test than those applicants with lower self-rankings.

The last three sets of categories shown in Table 2 apply to characteristics of the ZIP code where the student resided. Perhaps surprisingly, those in more affluent areas and in highly urbanized areas were less likely than others to take the test multiple times. With respect to the racial composition of ZIP codes, those living in areas with higher percentages of blacks were more likely to retake the SAT.

To summarize, Table 2 identifies several groups that were more likely than others to take the SAT multiple times: those with low initial scores, women, Asian Ameri-cans, those who rate themselves as average or below in ability, and those who live in less affluent, rural, or predominantly black neighborhoods. On their face, these simple correlations seem to dispel any notion that retaking is the exclusive or even preponderant domain of the affluent or urbanized.

For a fuller answer to the question of who retakes the test, it is necessary to examine the partial effects of various characteristics, holding other things constant. Our model, described in Section V below, suggests that there are three basic reasons why two individuals with the same initial test scores might be differentially likely to retake the test. First, individuals might have different expectations regarding the scores they would receive on the next test. Second, they may face different direct and indirect costs of retaking the test. Finally, they may attach different values to

9. In contrast, Boldt, Centra, and Courtney (1986, p. 4), using data for 87 colleges, found the highest rates of retaking among whites and upper-income applicants. This difference in findings may be attributable to the differences in selectivity in the institutions studied.


(8)

Table 3

Probit Equations Explaining Taking SAT at Least Two, Three or Four Times

Dependent variable: Indicator for whether applicant takes nth test conditional on having takenn⫺1

Variables n⫽2 n⫽3 n⫽4

Previous SAT math score ⫺0.005* (0.0002) ⫺0.004* (0.0002) ⫺0.003* (0.0003) Previous SAT verbal score ⫺0.004* (0.0002) ⫺0.002* (0.0002) ⫺0.002* (0.0003) Female 0.038 (0.023) 0.058* (0.021) ⫺0.051 (0.036) Family income

Less than $40,000 (excluded) (excluded) (excluded) $40,000–$60,000 0.035 (0.043) 0.060 (0.037) 0.060 (0.062) $60,000–$80,000 0.077 (0.044) 0.085 (0.038) 0.083 (0.064) $80,000–$100,000 0.083 (0.048) 0.112* (0.042) ⫺0.042 (0.072) More than $100,000 0.078 (0.043) 0.181* (0.037) 0.093 (0.064) Unknown, not reported 0.082 (0.049) 0.103 (0.044) 0.044 (0.079) Father’s education

Up to high school graduate (excluded) (excluded) (excluded) Some college 0.045 (0.048) 0.110* (0.041) 0.082 (0.073) College graduate 0.142* (0.047) 0.150* (0.041) 0.216* (0.071) Professional degree 0.139* (0.049) 0.156* (0.043) 0.305* (0.076) Unknown, not reported 0.129 (0.100) 0.051 (0.086) 0.018 (0.157) Mother’s education

Up to high school graduate (excluded) (excluded) (excluded) Some college 0.001 (0.043) 0.065 (0.037) 0.026 (0.065) College graduate 0.110 (0.044) 0.126* (0.038) 0.015 (0.065) Professional degree 0.111 (0.047) 0.071 (0.041) 0.055 (0.072) Unknown, not reported ⫺0.057 (0.112) 0.175 (0.099) 0.171 (0.169) Class rank

Top 10% (excluded) (excluded) (excluded)

11 to 40% ⫺0.187* (0.029) ⫺0.168* (0.026) ⫺0.148* (0.044) Bottom 60% ⫺0.371* (0.060) ⫺0.245* (0.051) ⫺0.280* (0.090) Unknown, not reported ⫺0.103 (0.043) ⫺0.114* (0.040) ⫺0.117 (0.072) Self-reported math ability

Among highest 10% (excluded) (excluded) (excluded) Above average, not in top ⫺0.032 (0.029) ⫺0.065 (0.026) ⫺0.063 (0.046)

10%

Average or below average ⫺0.274* (0.049) ⫺0.162* (0.041) ⫺0.103 (0.070) Unknown, not reported 0.075 (0.183) 0.028 (0.152) 0.075 (0.215) Self-reported writing ability

Among highest 10% (excluded) (excluded) (excluded) Above average, not in top 0.017 (0.028) ⫺0.015 (0.026) ⫺0.116 (0.045)

10%

Average or below average ⫺0.126* (0.040) ⫺0.154* (0.035) ⫺0.209* (0.060) Unknown, not reported 0.055 (0.174) 0.064 (0.146) 0.110 (0.206) Race

White (excluded) (excluded) (excluded)


(9)

Table 3(continued)

Dependent variable: Indicator for whether applicant takes nth test conditional on having takenn⫺1

Variables n⫽2 n⫽3 n⫽4

Native American ⫺0.093 (0.147) ⫺0.163 (0.127) ⫺0.150 (0.212) Asian American 0.334* (0.042) 0.402* (0.035) 0.358* (0.056) Hispanic 0.071 (0.069) ⫺0.045 (0.063) 0.184 (0.115) Other 0.011 (0.070) 0.141 (0.072) 0.283 (0.125) Percent black in home ZIP 0.002 (0.001) 0.001 (0.001) 0.001 (0.001)

code

Percent Hispanic in home 0.002 (0.002) 0.003 (0.001) ⫺0.002 (0.002) ZIP code

Percent urban in home ZIP 0.002* (0.0004) 0.002* (0.0004) 0.002* (0.001) code

Percent of households with 0.003* (0.001) ⫺0.0001 (0.0008) ⫺0.004 (0.002) income more than $75,000

in home ZIP code

First SAT in Fall 1995 4.107* (0.259) 1.611* (0.418) 0.303 (0.652) First SAT in Spring 1996 4.319* (0.230) 1.418* (0.414) 0.209 (0.650) First SAT in Fall 1996 3.652* (0.213) 0.858 (0.412) ⫺0.375 (0.650) First SAT in Spring 1997 3.103* (0.209) 0.057 (0.412) ⫺0.857 (0.649) First SAT in Fall 1997 1.446* (0.210) ⫺0.935 (0.417) ⫺0.745 (0.672) First in Spring 1998 (excluded) (excluded) (excluded)

Sample size 22,678 18,638 7,631

Log likelihood ⫺8006.6 ⫺10,548.6 ⫺3418.0

Pseudo R2 0.2465 0.1636 0.1368

Notes: Table entries are probit coefficients. Standard errors are in parentheses. * denotes coefficients significant at the 1 percent level.

being admitted to a selective college. To examine the basic relationships between observed applicant characteristics and these underlying traits, Table 3 presents a series of probit equations explaining the decision to take the test twice, the decision to take the test a third time conditional on two administrations, and the decision to take it a fourth time conditional on three administrations. Included as explanatory variables are scores from the preceding test administration, gender, family income, father and mother’s education, self-reported class rank, self-reported math and writ-ing ability, race, and characteristics of the student’s ZIP code area. Finally, the equa-tions include dummy variables for the date of initial test administration.

In sharp contrast to the impression given by the simple distributions shown above, the effect of holding constant previous scores and other variables is to reveal that retaking is indeed associated with greater affluence and parental education, among other things. The equations make clear, first of all, that retaking is strongly associated with scores on the previous SAT, those scoring high being less likely to retake the test. For example, an increase of 50 points on both the math and verbal scores on


(10)

the initial SAT test is associated with a decrease of 8 percentage points in the proba-bility of taking it a second time.10

Holding constant previous scores, family income now has a much clearer associa-tion than it appears to have in Table 2. Those whose parents made more than $60,000 had a 1.5 percentage point higher probability of retaking the test than those whose family incomes were below $40,000. Conditional on taking twice, applicants from these higher-income families were between 3.3 and 7 percentage points more likely to take the test a third time. Income does not significantly influence whether a three-time taker returns for a fourth test administration. Similarly, both mother’s and fa-ther’s education have a statistically significant effect, with those whose parents were college graduates more likely to retake the test, other things equal. As with income, the strongest effects of parental education appear in the decision to take the test a third time. Two-time takers whose fathers obtained a college degree were 6 percent-age points more likely to retake the test than otherwise identical applicants with a high-school-educated father. The marginal effect of mother’s education is of some-what smaller, though still significant, magnitude.

High school rank also has a significant association, with those in the top 10 percent being most likely to retake the test a second, third, or fourth time, with all other characteristics (including prior test scores) held constant. Similarly, those who as-signed relatively low ratings to their own math and writing ability were less likely to retake the test than those putting themselves in the top 10 percent. As with family income and parental education, the impact of class rank and self-assessment appears strongest in the decision to take the test a third time. In that case, either a class rank outside the top 10 percent, a low math self-assessment, or a low verbal self-assess-ment predicts a 6 to 9 percentage point decrease in the probability of retaking.

By race, blacks were less likely than whites to take the test two or three times. Other things equal, the probability of a black student taking the test at least twice was 4.5 percentage points less than that of an otherwise identical white student; conditional on taking the test twice, the black-white differential in the probability of taking the test a third time was 5.9 percentage points.11Asian Americans, on the other hand, were consistently more likely to retake the test: they were 5.5 percentage points more likely than an otherwise identical white student to take the test a second time. Conditional on taking twice, Asian applicants were 15.9 percentage points more likely to take the test a third time relative to otherwise identical white appli-cants. The test taking behavior of other racial or ethnic groups is not distinguishable from that of whites.

Applicants living in urban areas were more likely to retake the test a second, third or fourth time. As with many indicators, the strongest effect of urbanicity on retaking

10. Table 3 reports the actual probit coefficients, which have no natural interpretation. Our interpretation of the estimated effects in the test considers the impact of a unit change in one variable when all other variables are set equal to their means. In general, the effect of a variable on the probability of retaking will depend on the values of all variables in a probit equation.

11. This finding relates to Bowen and Bok’s (1998) result that controlling for other factors, SAT scores are less predictive of college class rank for blacks than for whites. Controlling for other factors, blacks retake the test less frequently than whites, implying that their reported SAT scores will be noisier point estimates of their true ability. Greater measurement error should lead, in turn, to poorer predictive power when SAT scores are used as an explanatory variable.


(11)

occurs in the decision to take the test a third time, where an applicant from a com-pletely urban ZIP code was 8 percentage points more likely to retake the test than an applicant from a completely rural ZIP code. Other ZIP code characteristics, such as racial composition and income, do not display a consistent relationship with re-taking.

Finally, those who took their first SAT early were generally more likely than others to retake the test. This result is not surprising, since those applicants who initially took the SAT on a late date would simply not have had many chances to retake the exam.

In conclusion, the empirical analysis of who retakes the SAT indicates significant differences by race, income, parental education, self-reported class rank and ability, and type of community. Most of these relationships are obscured in the raw data, presumably by the extremely strong tendency for students who score well on the test to refrain from taking it again. These explanatory variables might measure differ-ential expectations regarding future test scores, variation in test-taking costs, or varia-tion in the benefits associated with admission. Many of the applicant characteristics associated with a lower propensity to retake are also correlated with lower overall SAT scores, suggesting that applicants may form expectations in a manner that re-sembles statistical discrimination.12The greatest amount of selection into the pool of retakers appears to occur in the decision to take the test a third time.

IV. Explaining the Increase in Scores

The tendency for SAT scores to increase is evident in the averages presented in Table 4. Using both nationwide data and data from our three-institution sample, the table shows that students taking the test on average improve their scores with each successive administration.13This tendency applies to both math and verbal tests. Consider, for example, those who took the SAT three times. Among all those in the 1997 national cohort, the average score among this group on the verbal test increased from 493 in the first taking to 515 on the third and from 510 to 537 on the math. Within our sample of applicants to three institutions, the comparable in-creases were 573 to 602 for the verbal and 555 to 583 for the math. For all those taking the test at least twice, the average increase on the second try in the national sample was 13 points for the verbal and 16 points for math. The comparable increases for the three-university sample were both about 16 points. Both samples show the same thing: retaking the test is associated with higher scores.

What explains these score increases? At least three possible reasons for this ten-dency suggest themselves. First, improvement might arise because of students’ creased familiarity with the SAT test, its format, and the kinds of questions it in-cludes. Second, rising scores may reflect the general increase in knowledge that one

12. A simple regression of first-time SAT scores on the covariates (other than SAT scores) in Table 3 reveals that income, parental education, class rank, self-reported math and writing ability, residence in an urban or wealthy ZIP code, and Asian racial background are all significantly positively correlated with test scores. Black, American Indian, Hispanic, and Female applicants receive significantly lower scores on their first test. The R2for this regression, with 22,678 observations, is 0.52.


(12)

Table 4

Average SAT Scores for Students, by Number of times Taking the test

Panel A: 1997 Graduating Cohort

Average score increase Number of times taken over

previous

1 2 3 4 5 testc

Number of applicantsa 567,495 426,569 107,870 15,633 2,417

Percentageb 50.6 38.0 9.6 1.4 0.2

Average score—Verbal

First test 492 507 493 468 442

Second test 520 504 480 453 13

Third test 515 488 460 11

Fourth test 499 469 11

Fifth test 480 11

Average score—Math

First test 492 512 510 495 481

Second test 528 525 511 496 16

Third test 537 522 507 12

Fourth test 532 518 10

Fifth test 526 8

Panel B: Three University Sample

Number of times taken Average score increase over

1 2 3 4⫹ previous testc

Number of applicantsd 4,040 11,007 6,000 1,631

Percentageb 17.8 48.7 27.0 6.5

Average score—Verbal

First test 649 602 573 538

Second test 617 591 562 15.9

Third test 602 575 11.1

Fourth test 582 7.7

Average score—Math

First test 641 589 555 515

Second test 606 572 536 16.2

Third test 583 548 10.8

Fourth test 557 9.7

a. Data are based on 1,119,984 students who took the SAT1 one to five times in their junior or senior years.

b. Rows sum to 100.0.

c. Calculated for all those who took the test at least the indicated number of times.

d. Data are based on 22,678 students who graduated in 1998 and took the SAT1 one to four or more times from their spring of sophomore year to their senior years.


(13)

expects to correspond to aging and time in school. Third, the increase could arise out of a selection effect, whereby those who had performed badly (relative to their expectations) constituted the bulk of retakers, in which case the improvement in their scores might arise from ordinary regression to the mean. The first two possible causes could be thought of as ‘‘real,’’ as opposed to mere selection.

Determining whether scores truly tend to ‘‘drift upward’’ upon retaking is an important precursor to modeling retaking behavior. To test whether the observed increases in test scores are consistent with selection effects, we employ the two-stage Heckman sample-selection procedure (Heckman 1979).14The first stage of the procedure consists of probit regressions identical to those reported in Table 3 above, which predict the probability that any particular applicant will enter the sample of retakers. In the second stage, we estimate the following equations for both math and verbal test scores:

(1) (Predicted Test Score Gain)ij ⫽βˆ0j⫹βˆ1ˆi,

whereiindexes students,jrepresents either the verbal or math score, andλˆiis an applicant’s inverse Mills ratio as estimated by the first-stage probit regression.15This procedure allows us to predict selection-corrected test score gains for applicants both in and out of sample.16Table 5 presents estimates ofβˆ

0jandβˆ1j.

We estimate separate second-stage equations for test score changes from the first to the second, second to third, and third to fourth test administrations; one set each is estimated for the verbal and math parts of the test. For all six equations, the selection coefficientβˆ1jis negative, indicating that, as theory would predict, those individuals most likely to retake the test are those with the highest expected score gains. Esti-mates of selection into the pool of two- and three-time takers are statistically signifi-cant; selection into the pool of four-time takers is not.

To give an idea of the importance of this selection effect on the amount of gains, we use the second-stage Heckman results to calculate simple predicted score changes between thenth and (n⫹1)th administrations for all individuals who took thenth test, regardless of whether they actually took the (n⫹1)th. We compare these selection-corrected average test score gains with the observed average score gain, equating the difference in these values with a selection effect.

14. The Heckman selection correction procedure presumes that the error terms in the sample selection equation (the probit equations reported in Table 3) and the outcome equations (where outcomes here are increases in test scores) follow a bivariate normal distribution, with some correlation between the error terms. In other words, individuals with exceptionally low (latent) increases in test scores might also be exceptionally unlikely to retake the test.

15. The inverse Mills ratio for each observation is computed as follows:

λˆi

φ(xiγˆ)

Φ(xiγˆ)

wherexiis the vector of characteristics included on the right hand side of the probit equation,γˆ is the vector of estimated coefficients, andφandΦare the density and cumulative density of the standard normal distribution, respectively. Relatively low values of the inverse Mills ratio in this case correspond to individ-uals with a relatively high probability of retaking the test.

16. We restrict the right hand side of Equation l to include only an intercept term and the inverse Mills ratio because we are interested only in obtaining a mean predicted value from this regression, rather than estimating consistent values of other parameters. Adding additional explanatory variables would change individual predicted values, but would not influence the mean predicted value.


(14)

Table 5

Explaining Improvement in SAT Scores and the Role of Selection for Those Taking SAT at Least Two, Three, or Four Times

Dependent variable: change in score from

(n⫺1)th tonth test

n⫽2 n⫽3 n⫽4

Math

Estimates of Heckman second-stage parameters

βˆ0j 23.2* 21.8* 20.0

(0.652) (1.73) (8.60)

βˆ1j ⫺29.4* ⫺14.1 * ⫺8.09

(2.63) (2.52) (8.39)

Decomposition of test score changes

Observed score gain 16.2 10.8 10.5

Expected score gain, corrected for selection 14.0 7.6 8.0

Difference (due to selection) 2.2 3.2 2.5

Verbal

Estimates of Heckman second-stage parameters

βˆ0j 23.6* 19.7* 13.8

(0.657) (1.69) (8.15)

βˆ1j ⫺32.7* ⫺11.1* ⫺5.19

(2.66) (2.45) (7.94)

Decomposition of test score changes

Observed score gain 15.9 11.1 7.7

Expected score gain corrected for selection 13.4 8.6 6.1

Difference (due to selection) 2.5 2.5 1.6

N (first stage) 22,678 18,638 7,631

Note: Standard errors in parentheses.

* denotes coefficients significant at the 1 percent level.

Consistent with the negative point estimates ofβˆ1j, at least part of observed test score gains can be attributed to selection into the pool of applicants in all six cases. This procedure also shows, however, that most of the observed gain in test scores associated with retaking cannot be attributed to selection. Between 70 and 90 percent of the observed test score gain in each instance is robust to selection correction.17

17. Simple ‘‘back-of-the-envelope’’ calculations confirm the notion that the observed test score increases are unlikely to result from selection effects. Consider the test score changes from the first to the second administration. Among the 82 percent of applicants who retake the test, the average score increase is roughly 16 points on both the math and verbal scales. If the average test score increase in the entire population were zero, the remaining 18 percent of applicants would have to expect average test score decreases of 73 points on both the math and verbal scales. The College Board reports that the test-retest standard deviation of SAT scores is roughly 30 points. The selection-only hypothesis therefore implies


(15)

These residual test score gains can be considered real and attributable either to a gain from familiarity with the test or to gains due to learning more over time.18

V. Model and Implications

To better understand the origins and implications of retaking and to consider how behavior might change under alternative policy regimes, it is helpful to think about what goes into the decision to retake the test. We find it useful to begin by thinking of the SAT as simply a series of questions designed so that a given student faces the same probabilityρof answering each question correctly. We refer toρas true ‘‘ability,’’ although by using this term we do not mean to weigh in on the question of what exactly the test does measure. In this case, the percentage of questions a student actually answers correctly,p, is an estimate of the true parame-ter ρ. As described, this setup is equivalent to a series of Bernoulli trials, and the distribution ofpis given by a binomial. We extend this logic to two different kinds of ability, mathematical and verbal, with the student’s true mathematical abilityρm and true verbal abilityρv. Following the binomial analogy, we assume the population distribution of these parameters falls between zero and one, with the potential that these two ability measures may be correlated with one another. Importantly, we also assume that applicants do not know their true ability parameters, but make inferences about them on the basis of information received through test scores and other sources. We envision a simple admissions process in which the college’s admissions office attempts to rank its applicants according to ability, based on the point estimates received through applicant test scores. The applicant’s objective in deciding how many times to take the test is to maximize the probability of admission, subject to cost-related constraints. Consistent with the notion that applicant ability as measured by the SAT is not the only criterion for admission, we presume that there is no set of SAT scores that guarantees admission.

Applicants who retake the SAT are effectively submitting multiple point estimates of their true mathematical and verbal ability. It is clear that applicants’ incentives to retake the test will be influenced by how a college treats these multiple point estimates. Were colleges to accept only the first set of point estimates, {p1

m,p1v}, then applicants would never retake the test, so long as the cost of taking the test were positive. When colleges consider point estimates other than the first set in determining their ranking, students can be expected to retake the test whenever they believe the benefits of doing so will exceed the costs.

The highest-score policy commonly used by college admissions offices translates into the following rule: for a student who has taken the testntimes, use the maximum

that 18 percent of the population score more than 2.5 standard deviations above their individual means on the first administration. This presumption strikes us as untenable. As additional support of this evidence, the simulation results discussed in Section VI below also suggest that selection plays at most a small role. 18. We have estimated the average expected ‘‘upward drift’’ for applicants in our sample. It is reasonable to think that expected upward drift varies within the population. Some individuals may learn more quickly than others, or adapt their test-taking behavior more rapidly. Variation in expected upward drift could provide another reason for individuals with identical initial test scores to make different retaking decisions. The effect of this variation would be similar to variation in prior beliefs, as described in our model.


(16)

mathematical score, max {p1

m, . . . , pnm}and the maximum verbal score, max{p1v, . . . ,pnv}as point estimates ofρmandρv. Applicants will choose to retake the test when they judge that the increase in the probability of admission associated with expected changes in either their mathematical or verbal score is sufficiently large to justify the costs of retaking. The applicant’s problem becomes analogous to one of optimal search with the possibility of recall (Stigler 1961, DeGroot 1968). An appli-cant who faces dollar-denominated test-taking costs, including fees, opportunity costs, and psychic costs, equal toc, places a dollar valueVon admission, and has received maximum math and verbal scores ofp*mandp*v in their previous test admin-istrations will retake the test if and only if:

(2)

pm

pv

V

a(max{pm,p*m}, max{pv,p*v})⫺a(p*m,p*v)

f(pm)f(pv)⬎c. wherea(pm,pv) represents the probability of admission given test scorespmandpv,

f(pm) is the applicant-specific probability density function for math point estimates, andf(pv) is the corresponding probability density for verbal point estimates.19 Imbed-ded in the equation is the assumption that in each test administration, the point esti-matespmandpvare drawn independently from their respective marginal distributions. The equation also assumes that the point estimates take on a finite number of val-ues—a reasonable assumption, since there are only 61 unique scores on the SAT math and verbal scales.20

If applicants knew the value of their underlying ability parameters with certainty, their optimal decision would be to continue taking the test until they had achieved some ‘‘reservation test score.’’21But because we presume that students donotknow their underlying ability parameter with certainty, the typical reservation test score rule will not apply (Rothschild 1974).To determine an applicant’s decision rule in this scenario, we presume that applicants receive and process information in a Bayes-ian manner.

We begin by assuming that applicants receive prior information regarding their subjective distributions off(ρm) andf(ρv) by receiving ‘‘practice draws’’—pre–test administration Bernoulli trials that might be thought of as information contained in school grades, previous standardized test scores, and the like. Along with the scores from their first test administration, these draws form their information set as they decide whether to take the test a second time. Based on the information contained in their first test scores and practice draws, applicants form a posterior probability distribution for the underlying parameters for use in their decision on whether to retake the test. Following any subsequent test administrations, applicants once again

19. We assume here that all applicants face the same acceptance probabilities,a(pm,pv). The simulation

we perform is unaffected if the acceptance probability surface mapped in Figure A2 shifts up or down uniformly. To the extent, however, that the acceptance probability surface differs substantially in shape across categories of applicants, we are overlooking an important source of variation in behavior. Affirma-tive action programs, for example, may result in a leveling up in the admission probability surface for some groups, which might in turn explain their reduced likelihood of retaking the test.

20. Scores range from 200 to 800 in increments of 10.

21. Because there are two elements to the SAT score, there would be no unconditional reservation values ofpmorpv. Rather there would be conditional reservation values: the value ofpmthat will induce an


(17)

update their posterior probability distributions. As changes in an applicant’s beliefs about her true ability influence the probability she attaches to receiving any particular test score, her ‘‘reservation test scores’’ may change over time.

Two applicants who receive identical scores on their first test may be differentially likely to retake the test for three basic reasons. First, they may face different costs of retaking the test. Those with part-time jobs, for instance, will tend to face higher opportunity costs of taking a test than other applicants. Applicants may have differen-tial psychic costs of undergoing a testing procedure. Even testing fees themselves, which are generally constant, may impose differential utility costs. Second, the value they attach to being admitted to a college may differ. These first two factors can be consolidated into one: applicants may differ in the ratio of their test-taking costs to the benefits they attach to admission—the ratioc/V. Third, their prior beliefs, based on their practice draws, may lead them to expect different scores on their next test.

VI. Simulating Test-Taking Behavior

There are two basic reasons to simulate test-taking behavior. First, simulations provide further evidence on the relationship between an applicant’s re-taking behavior, the test-re-taking costs faced, the benefits attached to being admitted, and prior beliefs under the current policy. Second, simulations allow us to predict the impact of changes in SAT score ranking policy without actually ‘‘living through’’ the alternative policies.

A. Calibration and Results under the Current Policy

The simulation exercise we undertake here is calibrated in the sense that we choose parameter values that result in simulated behavior under the current SAT score rank-ing policy that closely resembles actually observed behavior under that same policy. To the extent that our simulation provides a reasonable facsimile of reality as ob-served in actual data, we have confidence that the procedure can suggest what changes in behavior might reasonably be associated with policy changes.

The simulation procedure involves the following steps:

1. For each of 1,000 simulated applicants, draw a value for ρm andρv, the applicant’s true math and verbal ability. We derive the population distribu-tion of values forρmandρvfrom our data on applicants to three selective universities. Specifically, these values are based on the distribution of first-time SAT scores in our data.22Scores are translated into ability parameters first by subtracting 200 (the minimum score) from each, then dividing the

22. To the extent that first-time SAT scores are not representative of applicants’ true ability in our data, we will be unable to fully match the behavior of our simulated applicants to patterns in the data. Since our data consist exclusively of applicants to selective institutions, it is quite likely that individuals with low observed first test scores are not representative of the entire population with low initial test scores. The implications of this selection issue are discussed below.


(18)

result by 600.23By deriving ability parameters directly from applicant SAT scores in our data, we are assuming thatρmandρveach take on one of 61 discrete values, like the scores themselves.

2. Randomly draw an initial value of c/V, the ratio of costs of test-taking to benefits of admission, for each applicant. For simplicity, the cost-to-benefit ratio will take on two values corresponding to ‘‘high cost’’ and ‘‘low cost’’ applicants. The values of c/V are calibrated to yield a pattern of retaking similar to that found in our data. In this simulation, the ratio of test taking costs to admission benefits increases linearly in the number of times previ-ously taken.

3. Administer 120 ‘‘practice’’ Bernoulli trials, 60 each with probability of suc-cessρmandρv. Using the results of these practice trials, each applicant forms a prior probability distribution that indicates their perception of their own ability prior to any actual test-taking. Applicants will never learn their true ability parameters; they will only receive information about them based on how they perform on tests.

4. Administer a simulated SAT, which consists of 60 independent Bernoulli trials each for the math and verbal scores, with probability of success equal toρmandρv, respectively.24Calculate the applicants’ SAT scores by multi-plying the number of successes by 10, then adding 200. Applicants then update their beliefs regarding the true values of their ability parametersρm andρv.

5. Applicants use their newly calculated posterior distribution on ρm andρv, their value ofc/V, and probabilities of admission conditional on test scores to decide whether to retake the SAT. Applicants are aware that they can expect their scores to drift upwards if they decide to retake. If applicants decide to refrain from retaking the test, the simulation stops.

6. For applicants that decide to retake the SAT, we administer an additional simulated SAT. Because our evidence presented in Section IV above indi-cates that individuals’ scores increase upon retaking, we increase the proba-bilities of success on the math and verbal exams. These increases inρmand ρvreflect the presumption that each time an applicant retakes the SAT, she can expect both her math and verbal scores to increase by about 10 points. Applicants use the information in their newest set of SAT scores to update their beliefs regardingρmandρv, then return to Step 5.

23. There are two exceptions to this translation. SAT scores of 800 are translated into ability parameters of 0.99 rather than l, and SAT scores of 200 are translated into ability parameters of 0.01 rather than 0. Setting ability parameters equal to zero or one in our simulation exercise would eliminate all uncertainty in an applicant’s test scores.

24. The use of 120 Bernoulli trials (60 each for math and verbal) to simulate an SAT administration can be justified on three grounds. First, the number of successes in 60 trials translates easily to the SAT scale. Second, the standard deviation of an applicant’s score distribution closely matches that observed in actual SAT scores (30 to 40 points). Third, the number of questions actually used to compute SAT math and verbal scores is roughly sixty.


(19)

Table 6

Simulation Results under the Current Score Ranking Policy

One-time Two-time Three-time Four-time takers takers takers takers

First scores math/verbal 585/600 565/581 601/611 588/613 Second scores math/verbal 580/601 610/617 593/620

Third scores math/verbal 626/637 595/623

Fourth scores math/verbal 623/644

Percent of sample 13.5% 48.2% 28.7% 9.6%

Mean true ability parameters math/verbal 573/582 570/587 604/614 586/610 Percent ‘‘high-cost’’ type 35% 19% 8% 0%

Details regarding the specific assumptions and parameter values used in the simu-lation can be found in the Appendix. The simusimu-lation was calibrated to match the probability of a simulated applicant taking the test a second, third, or fourth time to the observed probability of an applicant’s taking the test a second, third, or fourth time. As Table 6 shows, the calibration exercise performed relatively well in match-ing the observed probability of retakmatch-ing. Among our simulated applicants, the proba-bility of taking the SAT two or more times under the current score ranking policy is 86.5 percent; the probability of taking the SAT three or more times is 38.3 percent, and the probability of taking the SAT four times is 9.6 percent. Since very few actual applicants take the SAT five or more times, our simulation stops after the fourth test administration.

A comparison of Tables 4 and 6 suggests that our simulation fails to capture the exact nature of selection into the pool of retakers in two ways. First, in our data on actual applicants, the set of individuals stopping after one test administration obtains significantly higher scores, on average, than any other group. In our simulation, that is not the case. Second, our simulation suggests that applicants with exceptionally high test score gains are more likely to refrain from taking the test an additional time. Individuals who experience moderate increases, conversely, are more likely to take the test again. In our actual data, test score gains are spread more evenly through the population: the set of individuals who stop after the third administration, for example, experience roughly the same gain between the 2nd and 3rd administra-tions as do those who choose to take the test a fourth time.

The most plausible explanation for this divergence is our inability to model the selection of SAT takers into the pool of applicants to one of our three sample univer-sities. Our simulation procedure explicitly equates an applicant’s true ability with her scores on the first SAT administration. In reality, the applicants with low initial SAT scores in our sample are probably not representative of the overall population with low initial SAT scores, since our sample consists of selective institutions. In our actual data, individuals with low initial scores are more likely to retake, presum-ably because they believe that their initial scores underestimate their true ability. To compensate for this underestimate of retaking in a subset of the sample, we


(20)

overesti-mate the extent of the retaking in the general population—implying that our simu-lated applicants must score higher, relative to expectations, than actual applicants before deciding to stop taking the test.

This caveat should be considered carefully for two reasons. First, it implies that our simulation may not perfectly capture the degree of applicant response to changes in SAT score ranking policies. Second, it suggests that we are omitting one important source of applicant response to a change in test score ranking policies: the decision to apply in the first place. Bearing these concerns in mind, we will proceed with our analysis of simulation results.

Table 7 examines the determinants of retaking by presenting probit regressions analogous to the ones performed with actual data in Table 3 above. The first result, which predicts the probability of an applicant deciding to take the test a second time, indicates that cost-to-benefit ratios, prior beliefs, and first test scores each enter significantly into the equation. Comparing a high-cost and low-cost applicant with all other variables alike and equal to the mean values, the high-cost applicant is 20 percent less likely to retake the test. When all other variables are set equal to their respective means, an increase of 50 points in both the SAT math score and verbal score reduces the probability of retaking by about ten percentage points—a magni-tude quite similar to that derived from our actual data. Prior beliefs display a qua-dratic relationship with retaking behavior. For most applicants, the probability of retaking increases as the number of practice trial successes increases. This tendency decreases as the number of practice successes approaches the maximum value of 60. Holding other things constant then, applicants with more ‘‘pessimistic’’ prior beliefs are less likely to retake the test.

These basic results persist when analyzing the decision to take the test a third time or a fourth time. With each retaking, a greater fraction of high-cost types drop out. A high-cost applicant with mean values of all other variables is about 36 percent less likely to take the test a third time when compared to an identical low-cost appli-cant. As shown in Table 6, no high cost applicants choose to take the test a fourth time. These results bear a distinct resemblance to those discussed in Section III above, which indicated that the greatest degree of selection occurred in the decision to take the test a third time conditional on taking twice.

Interestingly, the probability of retaking the test appears to depend only on the most recently obtained set of SAT scores. Controlling for the most recent scores, previously received scores do not significantly affect the probability of retaking. Prior beliefs continue to significantly affect retaking, however.

The results from this analysis point to the same conclusions that we derived from our analysis of actual applicant data. In light of the caveats discussed above, this is encouraging. Here, we show that applicants with pessimistic prior beliefs are signifi-cantly less likely to retake the test. In Table 3, we showed that individuals with lower self-reported ability and lower class rank were less likely to retake the test, conditional on initial scores. The role of prior beliefs may also explain why many groups with lower average test scores, including African-Americans and those from low-income families, are less likely to retake the test conditional on initial scores. These groups might also face higher test-taking costs, another factor shown to be important in the simulation.


(21)

Table 7

Explaining Retaking Behavior in Simulated Data

Dependent Variable: Indicates whether applicant chooses to take

thenth test, conditional on having taken (n⫺1)

Independent Variable n⫽2 n⫽3 n⫽4

High cost type indicator ⫺1.345** ⫺1.265** —a

(0.166) (0.161)

Math prior successes 0.390** 0.352** 0.420**

(0.038) (0.046) (0.127)

Math prior successes squared ⫺0.003** ⫺0.003** ⫺0.004**

(0.0004) (0.0006) (0.002)

Verbal prior successes 0.395** 0.545** 0.319**

(0.041) (0.057) (0.140)

Verbal prior successes squared ⫺0.003** ⫺0.005** ⫺0.003*

(0.0005) (0.0006) (0.002)

First math score ⫺0.158** ⫺0.011 0.020

(0.020) (0.012) (0.021)

First verbal score ⫺0.150** 0.008 0.011

(0.178) (0.012) (0.024)

Second math score — ⫺0.046** 0.002

(0.012) (0.022)

Second verbal score — ⫺0.107** ⫺0.001

(0.013) (0.025)

Third math score — — ⫺0.092**

(0.020)

Third verbal score — — ⫺0.075**

(0.021)

Log likelihood ⫺219.99 ⫺434.72 ⫺175.00

N 1,000 865 360

Note: Standard errors in parentheses. Coefficients are derived from probit estimation of each equation. Test scores and number of prior successes each take on integer values between 0 and 60.

** Denotes a coefficient significant at the 5 percent level, * the 10 percent level.

a. Exactly zero high-cost types choose to take the test a fourth time; thus, the high cost indicator is dropped from this probit equation.


(22)

B. Evaluating the Current Score Ranking Policy

Our data on simulated applicants has one central advantage over data on actual appli-cants: we are able to observe the ability parameter that SAT scores are intended to estimate. We can therefore examine the effectiveness of current and alternative col-lege test score ranking policies in providing a high-quality point estimate of an appli-cant’s true ability. We use four different criteria to determine the quality of a ranking policy.

1. Accuracy. This is simply the average difference between the estimate of an ability parameter derived from a policy and the true value of that parameter (‘‘ranking error’’). Both positive and negative values are theoretically pos-sible.

2. Precision. This measure equals the standard deviation of ranking errors asso-ciated with a particular policy. A policy can be inaccurate yet precise, if ranking errors are more or less the same for all applicants. An imprecise policy is one where the ranking errors vary quite a bit from applicant to applicant. Precision can never be negative; values closer to zero are prefera-ble, other things equal.

3. Bias. This measure should not be confused with accuracy, which in a statisti-cal sense could be referred to as biasedness. Here, we refer to bias as the degree to which the test score ranking policy places high–test taking cost types at a disadvantage. It equals the difference between the average ranking error for low-cost types and the average ranking error for high-cost types. We presume that zero is the most preferred bias value.25

4. Cost. The cost of a ranking policy is simply the average number of test administrations per applicant observed under that policy. Other things equal, a ranking policy that induces a lower frequency of test-taking is considered superior. In using this criterion, we presume that the value of resources con-sumed in retaking the test exceeds the value of any benefits, such as learning, that accrue to the applicant in the process.

Table 8 presents our calculations of the accuracy, precision, bias and cost of the most common current SAT score ranking policy, along with those of several alterna-tive policies to be discussed in the following section.

Under the current SAT score ranking policy, our simulation suggests that admis-sions officers’ point estimate of the typical applicant’s ability is significantly higher than the true value.26The average deviation between the test score used to estimate an applicant’s ability and that applicant’s true ability, which we define as the

‘‘rank-25. It is conceivable that colleges might wish to implement a biased score ranking policy. Recall that it is not possible to separate individuals with a high test taking cost from those who place a low value on admission. If colleges determined that variation in benefits were much more important than variation in costs, and they wished to provide an advantage to students attaching the highest value to admission, then a biased policy would appear attractive.

26. Because we allow an applicant’s expected SAT score to drift upward upon retaking, there is some ambiguity as to what exactly should be considered ‘‘true’’ ability. In this analysis, we equate true ability with the applicant’s expected score on the first test administration.


(23)

Vigdor

and

Clotfelter

23

Comparing Student Test Score Ranking Policies

Policy Alternative Accuracy Precision Bias Cost

1. Current: use highest math score and highest verbal score, no correction for ⫹31m/⫹31v 29m/30v ⫹15m/⫹12v 2.3

upward drift.

2. Use highest math score and highest verbal score, correct for upward drift. ⫹20m/⫹17v 28m/27v ⫹2m/⫹5v 1.9

3. Use first submitted score only ⫺1m/⫹1v 35m/35v ⫹4m/0v 1.0

4. Average of all scores submitted, no correction for upward drift. ⫹3m/⫹5v 32m/32v ⫹1m/⫹2v 1.4

5. Average of all scores submitted, correct for upward drift. ⫹2m/⫹3v 34m/32v ⫹3m/⫺2v 1.2

6. Use last submitted score only, no correction for upward drift. ⫹16m/⫹17v 33m/33v ⫹5m/⫺1v 1.7

7. Use last submitted score only, correct for upward drift. ⫹8m/⫹7v 33m/33v ⫹6m/⫺4v 1.4

8. Mandatory retake, use average of first two scores only, no correction for up- ⫹4m/⫹5v 26m/24v ⫺1m/⫺1v 2.0

ward drift.

9. Mandatory retake, use average of first two scores only, correct for upward drift ⫺1m/0v 26m/24v ⫺1m/⫺1v 2.0

Note: Results are based on simulations described in Section IV of the text. The simulation assumes that applicants receive prior information equivalent to one test administration before taking their first real test. Upward drift in test scores is equal to 10 points each on the math and verbal segments. Applicant cost-to-benefit ratios equal 0.015 for ‘‘low cost’’ types and 0.025 for ‘‘high cost’’ types. Roughly 85 percent of simulated applicants are assigned the ‘‘low cost’’ designation. The marginal cost of retaking the test is assumed to increase linearly with the number of administrations. ‘‘Accuracy’’ is equal to the average difference between an applicant’s test scores and true ability under the ranking policy indicated. ‘‘Precision’’ is equal to the standard deviation of differences between an applicant’s test scores and true ability. ‘‘Bias’’ is the difference in accuracy measures between low and high cost types. Cost is equal to the average number of test administrations per applicant under the indicated policy.


(24)

ing error,’’ is approximately 30 points on the verbal scale and 30 points on the math scale. Because current policy effectively picks the most positive outlier among competing point estimates, it is not surprising that applicants are consistently rated above their true ability. The tendency for scores to drift upward upon retaking exacer-bates ranking errors.

Ranking errors also differ appreciably among applicants under the current policy. The precision values convey this basic fact, and the bias values show one component of the variance in ranking errors across applicants. In our simulation, high-cost appli-cants, who were approximately 20 percent less likely to retake the test and 36 percent less likely to take the test a third time conditional on taking twice, are consistently ranked lower than low-cost applicants of equal true ability. By taking the test more frequently, low-cost applicants are receiving more chances to draw a positive outlier from their distribution of possible test scores. These applicants also are more likely to benefit from upward drift in test scores. If applicants are ranked according to the sum of their math and verbal scores, the average high-cost applicant in this simula-tion is placed at a 27-point disadvantage relative to an equivalent low-cost rival.

Finally, the current policy leads to a situation where the average applicant takes the test 2.3 times. By construction, this value is observed both in our simulated data and in our actual data on applicants to selective colleges.

C. Evaluating Alternative Score Ranking Policies

Table 8 goes on to present the results of additional simulations under different test score ranking policies. For each policy, it is necessary to repeat the simulation since changes in ranking policies, by altering the potential benefits from retaking the test, will influence applicant behavior. There are a total of eight proposed alternative policies to consider.

Policy 2: Correct for upward drift in test scores

One alternative to current policy would be to deflate second and subsequent test scores to reflect the fact that scores tend to increase upon retaking. Before choosing an applicant’s highest math score and highest verbal score, second and subsequent test scores are corrected, so that they are drawn from the same distribution as the first submitted score.

Because this policy reduces the potential benefits of retaking the test, it is not surprising that the average number of test administrations per simulated applicant decreases. The typical applicant now takes the test 1.9 times, rather than 2.3. The policy also achieves a noteworthy reduction in bias—high-cost types are now at a seven-point rather than 27-point disadvantage. Accuracy improves somewhat, though estimates of the average applicant’s ability still exceed the true value by roughly 40 points. Precision improves slightly. Overall, this policy alternative ranks higher than the current policy on all dimensions.

Policy 3: Use only the first submitted score

This policy alternative would change the SAT to resemble the PSAT—a test ad-ministered once to each applicant, possibly on a uniform date. Applicants’ incentives to retake the test would be eliminated under this policy, greatly reducing the cost in terms of test administrations per applicant. Since retaking is not an issue in this simulation, and an applicant’s cost is uncorrelated with true ability, accuracy and bias


(25)

would equal zero if the simulation’s sample size were large enough. The principal disadvantage of this policy is a reduction in precision, since most applicants are taking the test fewer times, there is simply less information to use in the creation of a point estimate. Relative to previously considered policies, the standard deviation of ranking errors under this alternative is roughly 25 percent higher. The cost savings, accuracy improvement, and elimination of bias achieved under this policy must therefore be weighed against the loss in precision.

Policies 4 and 5: Use the average of all scores submitted.

Under this policy, an applicant’s decision to retake the test is based on a different calculation than that presented in Equation 2 above. Using the same notation as in Section III, the applicant deciding to take the test for the nth time must determine whether

(3)

pm

pv

V

a

pm⫹(n⫺1)p¯m

n ,

pv⫹(n⫺1)p¯v

n

a(p¯m,p¯v)

f(pm)f(pv)⬎c,

wherep¯mandp¯vare the average of all previous math and verbal scores, respectively. This is a fundamentally different situation than in the previous case. Under current policy, it is not possible for an applicant’s ranking to decrease after retaking the test. Under this alternative, the applicant’s ranking may decrease if either the math or verbal score on the final test is lower than the average of scores on all previous tests. The simulation procedure, altered to reflect the new calculation in Equation 2, suggests that applicants would respond to this increase in downside risk by dramati-cally reducing the frequency with which they retake the test. When scores are not corrected for upward drift, the average applicant takes the test 1.4 times. Bias and accuracy are also greatly improved under this policy, even without correction for upward drift. Policy 5, which combines this revision of policy with the correction for upward drift described above, reduces test administrations to 1.2 per applicant, virtually eliminates bias, and attains near-perfect accuracy. Both policies feature the same drawback: a decrease in precision associated with collecting less information on each applicant. As with Policy 3, it is necessary to weigh the accuracy, bias and cost gains against the losses in precision when ranking these alternatives against current policy.

The reduction in retaking under the average score policy can be empirically cor-roborated by comparing the behavior of SAT takers with that of Law School Admis-sion Test (LSAT) takers. The most common test score ranking policy among law schools is to use the average of all submitted scores.27As these simulation results would suggest, the rate of retaking among law school applicants is significantly less than that among college applicants nationwide. Fewer than 20 percent of law school applicants take the LSAT more than once.28

Policies 6 and 7: Use only the last submitted score.

These policies change applicants’ incentives once again. Rather than perform the

27. In a survey of admissions websites for the top 20 law schools as ranked inUS News and World Report, nine schools explicitly stated policies for treating multiple scores. Of these, eight used the average score policy. The ninth school used the highest-score policy. The Law School Admissions Council advises mem-ber schools to use the average score policy.


(26)

cost-benefit comparison in Equation 2, applicants deciding whether to take the test annth time will now determine whether

(4)

pm

pv

V

a(pn

m,pnv)⫺a(pnm⫺1,pvn⫺1)

f(pm)f(pv)⬎c,

where superscripts indicate the test administration from which scores are derived. As in the preceding case, applicants now face the possibility of a decrease in ranking upon retaking. Following the analogy to a search model, this policy would eliminate the possibility of recall.

It is interesting to compare the versions of this policy that omit or include the upward-drift correction to Policies 4 and 5 above. Although the last-score policies outperform current policy in terms of cost, accuracy and bias, they are strictly inferior to the test-score averaging policies on all measures. They share similar precision values with the score-averaging policies.

Policies 8 and 9: Use exactly two scores for each applicant.

The final alternative policy involves a ‘‘mandatory retake’’ for all applicants. Like the first-score-only policy, this one eliminates the role of the applicant in determining test-taking strategy. As with that earlier policy, this one eliminates the potential for bias; perfect accuracy is attained in the version of the policy that corrects the second score for upward drift. This policy, with or without upward drift correction, achieves the best precision ranking of any alternatives. Mandating additional retakings would further improve the precision of the final ranking.

Because no single policy offers the best combination of accuracy, precision, bias, and cost, ranking the score-ranking policies is inherently a subjective matter. The one clear comparison that can be made here involves the current policy, which is strictly dominated along all four measures by both Policy 8 and Policy 9—the man-datory-retake policies that average exactly two test scores from each applicant, with or without correction for upward drift in the second score. These alternatives are less costly, less biased, more accurate, and more precise than current policy.29Other alternatives may in turn be preferred to these, depending on the degree to which policymakers are willing to exchange greater precision for lower costs.

D. Importance of Existing Bias

The simulation-based comparisons offered in the preceding section imply that certain SAT takers—those with high test-taking costs—are placed at a disadvantage by the common practice of colleges to use the highest submitted scores as point estimates of ability. Because test-taking costs are not directly observed, the question of how policy changes would affect particular observable groups, such as the socioeconomi-cally disadvantaged or racial minorities, remains open. Disadvantaged groups may

29. The true cost of Policies 8 and 9 might be higher than the cost of Policy 1 if we weighted the number of takings by the cost incurred. Policies 8 and 9 require some individuals who would not have retaken the exam to do so. If these individuals have extraordinarily high costs of taking the exam, the extra costs associated with these individuals might more than offset the reductions achieved through reducing retaking among other applicants. Policies 8 and 9 (as well as Policy 3) can also be criticized on the grounds that they constrain individual choice.


(1)

Table 9

Effect of Current (Highest-Score) Policy on Acceptance Rates: Simulations for Three Universities (weighted average of acceptance rates, by category)

Current First test Percentage

Policy (highest score) only difference

Overall (imposed) 6,675 6,675 0.0%

Race

White 5,637 5,657 0.4%

Black 174 187 7.5%

Native American 28 24 ⫺14.3%

Asian American 609 567 ⫺6.9%

Hispanic 106 117 10.4%

Other 121 123 1.7%

Income

Less than $40,000 655 673 2.7%

$40,000–$100,000 2,765 2,776 0.4%

More than $100,000 1,950 1,908 ⫺2.2%

Percent urban in home zip code

Less than 80% 2,229 2,238 0.4%

80–100% 4,446 4,437 ⫺0.2%

number of individuals truly affected by the policy—those who are on the border between acceptance and rejection—makes up only a small component of the overall applicant pool. Nonetheless, it is important to consider the impact of the highest-score policy. In an era of decline in affirmative action policies, any policy which places African-American applicants at a disadvantage, even if only an 8 percent disadvantage, merits scrutiny. It is also conceivable that applicants occupying other segments of the test score distribution might be disadvantaged in decisions other than admission, especially financial aid rankings.

It is also worth emphasizing that this exercise was undertaken using data on appli-cants to three selective colleges. To the extent that changes in test score ranking policies lead to changes in individual decisions to apply to selective colleges, this exercise will understate the true effect of changing policies.

VII. Conclusion

As a concluding exercise, we recap this paper’s central accomplish-ments and discuss some important issues that this paper does not address. We have shown that the practice of retaking the SAT tends to be highly concentrated within certain segments of the student population. Applicants to selective colleges such as the ones we examine appear to be significantly more likely to retake the test.


(2)

Appli-cants with lower initial test scores take the test more frequently. Controlling for initial test scores, applicants with higher family incomes, higher measures of self-reported ability and class rank, and better educated parents retake more frequently. Black applicants are significantly less likely to retake the test, other things equal; Asian applicants are more likely to retake. To the extent that these discrepancies reflect differences in test-taking costs, it appears that the advantages conferred upon retakers widen disparities among applicants.

We have also examined the empirical regularity that SAT scores increase upon retaking. Our analysis suggests that a relatively small component of these increases arises from selection. The remainder reflects a true effect attributable to some combi-nation of learning over time and familiarity with the test.

Finally, using both a numerical simulation and our applicant data, we have consid-ered the impact of altering the most common current test score ranking policy, the practice of using only the highest math and highest verbal scores. Our analysis sug-gests that there is much to dislike about this policy, and several alternatives that would outperform it along several criteria. Most important, we have shown that this policy is not neutral: It confers upon certain applicants advantages that should be considered as colleges draft and amend their admissions policies.

Why, then, do colleges employ the highest-score ranking policy? One possibility is that they lay great weight on precision, compared to the other criteria listed in Table 8. Certainly as compared to using the first score only, the highest-score policy yields a more precise estimate of true ability.31Another possibility is that the highest-score approach serves to blunt criticism and discourage appeals from applicants who feel their first or second test did not fairly reflect their abilities. Or perhaps admissions officers take multiple tests as an indication of an applicant’s effort. (In terms of our model above, colleges may be interested in selecting applicants with high values of

V.) Finally, because the highest-score policy tends to raise the SAT scores it reports to ranking groups likeU.S. News, a college may believe that following that policy will boost its own ranking.

This paper has not explicitly addressed many of the criticisms leveled at the SAT. Some have argued that the test is culturally biased. Any cultural bias effects that lead some applicants to score systematically lower exist over and above the potential bias we consider here, which arises solely because of retaking behavior. Some con-sider coaching to be a major problem. Amending test score ranking policies in re-sponse to concerns about retaking bias could conceivably lead some families to invest more resources in SAT preparation, or other activities designed to confer advantages in the college admissions process. Deprived of one opportunity to improve their ranking, applicants might well simply substitute other means. The effects of coaching 31. Boldt, Centra, and Courtney (1986) test the predictive validity of various test score ranking policies using data on SAT scores and first-year grades from 87 colleges. They find that the average of all submitted scores correlates most highly with subsequent grades. They do not, however, address two serious concerns, one of which is raised in our study. First, the amount of information in the dataset is endogenous to the test score ranking policy, as Table 8 illustrates. Second, since enrolled students are presumably not representative of the overall applicant pool, it may not be appropriate to estimate the predictive validity of SAT scores without accounting for sample selection. For further discussion of this second point, see Rothstein (2002).


(3)

on SAT scores and the influence of test score ranking policies on coaching and similar activities must be fully understood before any comprehensive policy is crafted.

While arguments to abolish the SAT entirely take place on campuses throughout the country, we have identified one potential source of concern with the test that could be addressed at little cost through alteration of existing policy. Future research into the behavior of test-takers might provide further insight into the proper role of standardized tests in college admissions and other important decisions.

Appendix 1

Simulation Details

The simulation procedure described in Section VI provides the basic structure of the simulation. This appendix simply reports the specific parameter values used to conduct the exercise.

Distribution of ability parameters. For each of 1,000 simulated applicants, a ‘‘true’’ value of mathematical and verbal ability was drawn from the bivariate

distri-Figure A1


(4)

bution of first-time test scores for observations in our dataset on applicants to three selective universities. Two random numbers were drawn from a uniform [0,1] distri-bution. A simulated applicant’s math ability was simply the percentile implied by the first random number. Verbal ability was the percentile of the conditional distribution (conditioning on math ability) implied by the second random number. The distribu-tion of ability parameters is presented as Appendix Figure A1.

Distribution of cost-to-benefit ratios. Cost-to-benefit ratios took on one of two values in this simulation. The exact values of these ratios and the distribution of types in the population were calibrated to lead to a pattern of retaking similar to that observed in applicant data. Simulated applicants had an 85 percent chance of being assigned to the low-cost-to-benefit ratio type, with an initial assignedc/Vratio of 0.15. High cost-to-benefit ratio types were initially assigned ac/Vratio of 0.25. As mentioned in the text, the cost-to-benefit ratio for all simulated applicants in-creased linearly with the number of test administrations.

Probability of admission. To calculate the expected benefits of retaking the test, simulated applicants must have information on the probability of admission condi-tional on point estimates of ability. Appendix Figure A2 illustrates this probability mapping. This distribution is based on admission information provided by one of the three institutions in our sample. One limitation of this assumed distribution is

Figure A2


(5)

that it does not hold other characteristics constant. Applicants with higher SAT scores probably have higher qualifications in other categories as well. A probability map-ping that held other factors constant would be flatter than the one shown here. The net effect of using a more accurate mapping on our simulations of alternative test score ranking policies would probably be limited, however, as a flatter mapping would lead to a different choice of parameters in our calibration exercise. When simulating alternative test score ranking policies, we do not alter the shape of this distribution, even though the number of students admitted to a college would increase or decrease if the overall distribution of test scores changed. Effectively, we are assuming that any changes in the probability of admission that resulted from a change in test score ranking policy would be implemented by vertically shifting this distribu-tion up or down. In practice, slight modificadistribu-tions of the shape of this distribudistribu-tion have relatively small influences on behavior.

References

Ballou, Dale, and Michael Podgursky. 1995. ‘‘Recruiting Smarter Teachers.’’Journal of Human Resources30(2):326–38.

Boldt, Robert F., John A. Centra, and Rosalea G. Courtney. 1986. ‘‘The Validity of Vari-ous Methods of Treating Multiple SAT Scores.’’ College Board Report No. 86-4, Col-lege Entrance Examination Board.

Bowen, William G., and Derek Bok. 1998.The Shape of the River: Long Term Conse-quences of Considering Race in College and University Admissions. Princeton: Princeton University Press.

Card, David, and A. Abigail Payne. 1998. ‘‘School Finance Reform, the Distribution of School Spending, and the Distribution of SAT Scores.’’ National Bureau of Economic Research Working Paper 6766, October.

DeGroot, Morris H. 1968. ‘‘Some Problems of Optimal Stopping.’’Journal of the Royal Statistical Society B30:108–22.

Dynarski, Mark R. 1985. ‘‘The Scholastic Aptitude Test: Participation and Performance.’’ UC Davis Economics Department Working Paper#258.

Dynarski, Mark, and P. Gleason. 1993. ‘‘Using Scholastic Aptitude Test Scores and Indica-tors of State Educational Performance.’’Economics of Education Review12(3):203–11. Graham, Amy E., and Thomas A. Husted. 1993. ‘‘Understanding State Variations in SAT

Scores.’’Economics of Education Review12(3):197–202.

Hanushek, Eric A., and Lori L. Taylor. 1990. ‘‘Alternative Assessments of the Perfor-mance of Schools: Measurement of State Variations in Achievement.’’Journal of Hu-man Resources25(2):179–201.

Heckman, James. 1979. ‘‘Sample Selection Bias as a Specification Error.’’Econometrica

47(l):153–61.

Lemann, Nicholas. 1999.The Big Test: The Secret History of the American Meritocracy. New York: Farrar, Straus and Giroux.

Nathan, Julie S. and Wayne J. Camara. 1998. ‘‘Score Change When Retaking the SAT I: Reasoning Test,’’ Research Notes, RN-05, College Entrance Examination Board, Sep-tember.

Powers, Donald E. and Donald A. Rock. 1999. ‘‘Effects of Coaching on SAT I: Reasoning Test Scores.’’Journal of Educational Measurement36(2):93–118.


(6)

Rothschild, Michael. 1974. ‘‘Searching for the Lowest Price When the Distribution of Prices Is Unknown.’’Journal of Political Economy82(4):689–711.

Rothstein, Jesse M. 2002. ‘‘College Performance Predictions and the SAT.’’ Center for La-bor Economics, University of California, Berkeley, Working Paper No. 45.

Schemo, Diana Jean. 2001. ‘‘Head of U. of California Seeks to End SAT Use in Admis-sions.’’New York TimesFebruary 17: A1.

Southwick, Lawrence, Jr., and Indermit S. Gill. 1997. ‘‘Unified Salary Schedule and Stu-dent SAT Scores: Adverse Effects of Adverse Selection in the Market for Secondary School Teachers.’’Economics of Education Review16(2):143–53.

Stigler, George J. 1961. ‘‘The Economics of Information.’’Journal of Political Economy