Evaluating Alternative Score Ranking Policies

24 The Journal of Human Resources ing error,’’ is approximately 30 points on the verbal scale and 30 points on the math scale. Because current policy effectively picks the most positive outlier among competing point estimates, it is not surprising that applicants are consistently rated above their true ability. The tendency for scores to drift upward upon retaking exacer- bates ranking errors. Ranking errors also differ appreciably among applicants under the current policy. The precision values convey this basic fact, and the bias values show one component of the variance in ranking errors across applicants. In our simulation, high-cost appli- cants, who were approximately 20 percent less likely to retake the test and 36 percent less likely to take the test a third time conditional on taking twice, are consistently ranked lower than low-cost applicants of equal true ability. By taking the test more frequently, low-cost applicants are receiving more chances to draw a positive outlier from their distribution of possible test scores. These applicants also are more likely to benefit from upward drift in test scores. If applicants are ranked according to the sum of their math and verbal scores, the average high-cost applicant in this simula- tion is placed at a 27-point disadvantage relative to an equivalent low-cost rival. Finally, the current policy leads to a situation where the average applicant takes the test 2.3 times. By construction, this value is observed both in our simulated data and in our actual data on applicants to selective colleges.

C. Evaluating Alternative Score Ranking Policies

Table 8 goes on to present the results of additional simulations under different test score ranking policies. For each policy, it is necessary to repeat the simulation since changes in ranking policies, by altering the potential benefits from retaking the test, will influence applicant behavior. There are a total of eight proposed alternative policies to consider. Policy 2 : Correct for upward drift in test scores One alternative to current policy would be to deflate second and subsequent test scores to reflect the fact that scores tend to increase upon retaking. Before choosing an applicant’s highest math score and highest verbal score, second and subsequent test scores are corrected, so that they are drawn from the same distribution as the first submitted score. Because this policy reduces the potential benefits of retaking the test, it is not surprising that the average number of test administrations per simulated applicant decreases. The typical applicant now takes the test 1.9 times, rather than 2.3. The policy also achieves a noteworthy reduction in bias—high-cost types are now at a seven-point rather than 27-point disadvantage. Accuracy improves somewhat, though estimates of the average applicant’s ability still exceed the true value by roughly 40 points. Precision improves slightly. Overall, this policy alternative ranks higher than the current policy on all dimensions. Policy 3 : Use only the first submitted score This policy alternative would change the SAT to resemble the PSAT—a test ad- ministered once to each applicant, possibly on a uniform date. Applicants’ incentives to retake the test would be eliminated under this policy, greatly reducing the cost in terms of test administrations per applicant. Since retaking is not an issue in this simulation, and an applicant’s cost is uncorrelated with true ability, accuracy and bias Vigdor and Clotfelter 25 would equal zero if the simulation’s sample size were large enough. The principal disadvantage of this policy is a reduction in precision, since most applicants are taking the test fewer times, there is simply less information to use in the creation of a point estimate. Relative to previously considered policies, the standard deviation of ranking errors under this alternative is roughly 25 percent higher. The cost savings, accuracy improvement, and elimination of bias achieved under this policy must therefore be weighed against the loss in precision. Policies 4 and 5 : Use the average of all scores submitted. Under this policy, an applicant’s decision to retake the test is based on a different calculation than that presented in Equation 2 above. Using the same notation as in Section III, the applicant deciding to take the test for the nth time must determine whether 3 冱 p m 冱 p v V 冢 a 冢 p m ⫹ n ⫺ 1p¯ m n , p v ⫹ n ⫺ 1p¯ v n 冣 ⫺ ap¯ m , p¯ v 冣 f p m f p v ⬎ c, where p¯ m and p¯ v are the average of all previous math and verbal scores, respectively. This is a fundamentally different situation than in the previous case. Under current policy, it is not possible for an applicant’s ranking to decrease after retaking the test. Under this alternative, the applicant’s ranking may decrease if either the math or verbal score on the final test is lower than the average of scores on all previous tests. The simulation procedure, altered to reflect the new calculation in Equation 2, suggests that applicants would respond to this increase in downside risk by dramati- cally reducing the frequency with which they retake the test. When scores are not corrected for upward drift, the average applicant takes the test 1.4 times. Bias and accuracy are also greatly improved under this policy, even without correction for upward drift. Policy 5, which combines this revision of policy with the correction for upward drift described above, reduces test administrations to 1.2 per applicant, virtually eliminates bias, and attains near-perfect accuracy. Both policies feature the same drawback: a decrease in precision associated with collecting less information on each applicant. As with Policy 3, it is necessary to weigh the accuracy, bias and cost gains against the losses in precision when ranking these alternatives against current policy. The reduction in retaking under the average score policy can be empirically cor- roborated by comparing the behavior of SAT takers with that of Law School Admis- sion Test LSAT takers. The most common test score ranking policy among law schools is to use the average of all submitted scores. 27 As these simulation results would suggest, the rate of retaking among law school applicants is significantly less than that among college applicants nationwide. Fewer than 20 percent of law school applicants take the LSAT more than once. 28 Policies 6 and 7 : Use only the last submitted score. These policies change applicants’ incentives once again. Rather than perform the 27. In a survey of admissions websites for the top 20 law schools as ranked in US News and World Report, nine schools explicitly stated policies for treating multiple scores. Of these, eight used the average score policy. The ninth school used the highest-score policy. The Law School Admissions Council advises mem- ber schools to use the average score policy. 28. This figure quoted in personal communication with the LSAC statistical department. 26 The Journal of Human Resources cost-benefit comparison in Equation 2, applicants deciding whether to take the test an nth time will now determine whether 4 冱 p m 冱 p v V 冢 a p n m , p n v ⫺ ap n⫺1 m , p n⫺1 v 冣 f p m f p v ⬎ c, where superscripts indicate the test administration from which scores are derived. As in the preceding case, applicants now face the possibility of a decrease in ranking upon retaking. Following the analogy to a search model, this policy would eliminate the possibility of recall. It is interesting to compare the versions of this policy that omit or include the upward-drift correction to Policies 4 and 5 above. Although the last-score policies outperform current policy in terms of cost, accuracy and bias, they are strictly inferior to the test-score averaging policies on all measures. They share similar precision values with the score-averaging policies. Policies 8 and 9 : Use exactly two scores for each applicant. The final alternative policy involves a ‘‘mandatory retake’’ for all applicants. Like the first-score-only policy, this one eliminates the role of the applicant in determining test-taking strategy. As with that earlier policy, this one eliminates the potential for bias; perfect accuracy is attained in the version of the policy that corrects the second score for upward drift. This policy, with or without upward drift correction, achieves the best precision ranking of any alternatives. Mandating additional retakings would further improve the precision of the final ranking. Because no single policy offers the best combination of accuracy, precision, bias, and cost, ranking the score-ranking policies is inherently a subjective matter. The one clear comparison that can be made here involves the current policy, which is strictly dominated along all four measures by both Policy 8 and Policy 9—the man- datory-retake policies that average exactly two test scores from each applicant, with or without correction for upward drift in the second score. These alternatives are less costly, less biased, more accurate, and more precise than current policy. 29 Other alternatives may in turn be preferred to these, depending on the degree to which policymakers are willing to exchange greater precision for lower costs.

D. Importance of Existing Bias