W ∼ N W ∼ N ,

The key insight is that the mbtsls modification of the tsls esti- mator that makes it robust to the presence of many instruments and many exogenous variables is also robust to the presence of direct effects, provided these direct effects are uncorrelated with the effects of the instrument on the endogenous regressor. On the other hand, in order for liml to be consistent for all values of , then it has to be the case that 11 is equal to zero since min eig −1 0 otherwise. To provide some intuition, consider the reduced-form based on the model 3.1 : Y i = Z ′ i π 1 β + γ + W ′ i δ + π 2 β + ν i β + ǫ i , X i = Z ′ i π 1 + W ′ i π 2 + ν i . If the instruments are valid, so that γ = 0, then the vector of reduced-form coefficients on Z i in the first equation is propor- tional to π 1 , the vector of reduced-form coefficients in the second equation. The liml estimator tries to impose this proportionality. This leads to efficiency if proportionality holds, under Normal- ity and homoscedasticity Chioda and Jansson 2009 ; Anderson, Kunitomo, and Matsushita 2010 . However, if γ = 0, then the proportionality does not hold in the population, and liml loses consistency. On the other hand, mbtsls, like tsls, can be thought of as two stage estimators. In the first stage, composite instru- ments are constructed, one for each regressor endogenous or exogenous based on the data on the endogenous regressor, the exogenous variables, and the instruments alone. These instru- ments are then used to estimate the parameters of interest using a method for just-identified settings, possibly with some adjust- ment. In this procedure proportionality of the reduced forms is never exploited. This explains why 12 = 0 is a sufficient condition for consistency, although it results in efficiency loss relative to liml when proportionality does hold. Note also that the bias of the btsls estimator is relatively mi- nor: it is essentially proportional to the product of α K and α L , so that unless both are substantial, the bias will generally be small. However, the presence of many exogenous regressors might have a large effect on the probability limits of other estimators. For example, in previous version of this article Koles´ar et al. 2011 , we show that the jackknife instrumental variables estima- tor Angrist, Imbens, and Krueger 1999 may exhibit substantial bias when the number of exogenous covariates is large. Without the assumption that the direct effects are uncorrelated Assumption 5, if follows from the proof of Theorem 1 that the probability limit of k-class estimators has an additional term that is proportional to 12 : ˆ β ˆ κ p → β + 12 + 1 − α L − 1 − α K − α L κ 12 22 + 1 − α L − 1 − α K − α L κ 22 . 4.2 In this case, all the k-class estimators are in general inconsistent, and in fact there are no estimators for β that are consistent for all values of . On the other hand, the asymptotic bias of mbtsls, 12 22 , will be small so long as the covariance between the effect of the instruments on the outcome and their effect on the endogenous regressor is small relative to the strength of the instruments, 22 . The second main result concerns the asymptotic approxima- tion to the distribution of the mbtsls estimator. We focus on the mbtsls estimator because that is the only estimator in the k-class that is consistent under the assumptions we consider. A complication arises because, except in the special case where the only nonzero element of N is the 2, 2 element N, 22 the standard case with valid instruments, 11 = 0, the asymptotic distribution for ˆ β mbtsls depends on the stochastic properties of N − . To derive the asymptotic distribution of ˆ β mbtsls , we therefore make one additional assumption about the sequence of γ k and π 1,k . That is, similar to corresponding assumptions in the clustering literature, we assume that these parameters are random and make assumptions regarding their stochastic prop- erties. First, we redefine the parameters by orthogonalizing them with respect to Z ⊥ as ˜ γ ˜ π 1 = α K Z ′ ⊥ Z ⊥ 12 γ π 1 . Assumption 6 Incidental parameters. The pairs ˜ γ k , ˜ π 1,k , for k = 1, 2, . . . , K N , are iid with distribution ˜ γ k ˜ π 1,k

Z, W ∼ N

µ γ µ π , . The motivation for formulating the random effects assump- tion in terms of the orthogonalized parameters, rather than in terms of the original parameters, comes from the cluster struc- ture in our example in Section 2 . Exploiting that special struc- ture, the augmented concentration parameter can be written as the sample covariance matrix of γ k , π 1,k : N = N N G N G −1 k=1 γ k − γ 2 γ k − γ π 1,k − π 12 γ k − γ π 1,k − π 12 π 1,k − π 12 2 , where γ = 1 N G N G −1 k=1 γ k , and π 12 = 1 N G N G −1 k=1 π 1,k . Now let us consider Assumption 6 and interpret it in this context. Suppose we have a large population of clusters. Let δ + U k and π 2 + V k be the population means of Y i − βX i and X i in cluster k , and let δ and π 2 be the population average of the cluster means. In terms of the original parameterization, we have: π 1,k = V k and γ k = U k . The natural way to impose a random effects structure on the parameters would be to assume that the cluster means δ + U k , π 2 + V k are independent and Normally distributed: δ + U k π 2 + V k ∼ N δ π 2 , . 4.3 This implies ˜ γ ˜ π 1,k = N G − 1 N G B ⎛ ⎜ ⎝ δ + U 1 π 2 + V 1 .. . .. . δ + U N G π 2 + V N G ⎞ ⎟ ⎠ , B = I N G −1 − 1 − 1 √ N G N G − 1 ι N G −1 ι ′ N G −1 − 1 √ N G ι N G −1 , where the N G − 1 × N G matrix B satisfies Bι N G = 0, and BB ′ = I N G −1 . Thus, a random effects specification on δ + U k , π 2 + V k as in 4.3 implies a random effects specification Downloaded by [Universitas Maritim Raja Ali Haji] at 18:46 11 January 2016 on ˜ γ , ˜ π 1 , namely, ˜ γ k ˜ π 1,k

Z, W ∼ N ,

, with = N G − 1 N G · . 4.4 Given Assumption 6, it follows that the augmented concen- tration parameter satisfies = plim N N = plim 1 N γ ′ π ′ 1 Z ′ ⊥ Z ⊥ γ π 1 = plim 1 K N K N k=1 ˜ γ k ˜ π 1,k ˜ γ k ˜ π 1,k = µ γ µ π µ γ µ π ′ + . Now we can state the second main result of the article. Theorem 2 Asymptotic Normality with many invalid instruments. Suppose that Assumptions 1–6 hold. Suppose, in addition, that α K 0 and that 12 = 0. Then √ N ˆ β mbtsls − β d → N 0, −2 22 11 22 + α K 1 − α L 1 − α K − α L 11 22 + 2 12 + 11 22 + 22 α K . 4.5 If in addition 11 = 0 corresponding to the conventional many-instrument case, the distribution for ˆ β mbtsls is the special case of 4.5 with 11 = 0: √ N ˆ β mbtsls − β | Z d → N 0, −2 22 11 22 + α K 1 − α L 1 − α K − α L 11 22 + 2 12 . 4.6 In this case imposing Assumption 6 has no effect on the asymp- totic distribution. This result obtains because under the standard valid many-instrument asymptotic sequence, the Normal prior on the incidental parameters gets dominated, and the Bernstein– von Mises theorem applies see Koles´ar 2012 . The theorem assumes that the random effects are uncorre- lated, 12 = 0 to rule out the case in which ˜γ and ˜π 1 are corre- lated, but their second moment is zero, 12 = 12 + µ γ µ π = 0 because the correlation happens to be exactly offset by the means. This assumption is moot when instruments are cluster indicators, because then the means µ γ and µ π can be normalized to zero, with the mean effect of the instruments being captured by the terms in π 2 and δ that correspond to the intercept, as in Equation 4.4 . The theorem also assumes Normality Assumption 2ii. It is be possible to relax this assumption and instead only assume finite fourth moments Assumption 2i. Then the expression for asymptotic variance would have additional terms Hansen, Hausman, and Newey 2008 ; van Hasselt 2010 . However, these additional terms will be small unless the distribution of the error terms displays substantial skewness or kurtosis and the design matrix of the instruments is unbalanced, so that there are observations with high leverage, P Z ⊥ ii Koles´ar 2012 . We focus on the Normal case here to better highlight the substantive effect of relaxing the standard assumption that γ = 0. The asymptotic variance of ˆ β mbtsls is strictly larger if 11 0 than if 11 = 0. The additional term in the variance, 11 22 + 22 α K , diverges if α K goes to zero. In contrast with much of the many-instruments literature, where the pres- ence of many instruments is a nuisance, the number of instru- ments needs to increase with the sample size α K 0 for con- vergence of the estimator to be at √ N rate. The large number instruments is required so that 12,N N converges to zero at √ N rate. This is similar to the clustering literature, in which the number of clusters needs to increase in proportion with the sample size. One may be tempted to avoid this problem by not scaling by √ α K in the definition of ˜ γ , ˜ π 1 . However, such scaling would be rather unusual. For instance, in the clustering example above, it follows from Equations 4.3 and 4.4 that for Assumption 6 to hold, the variance of the cluster effects U k and V k would need to be proportional to K N N : if K N N → 0, this implies that in the limit, the cluster effects are exactly the same in each cluster, contradicting the Assumptions about N N Assumptions 4 and 5. For comparison, the asymptotic distribution of liml given 11 = 0 is √ N ˆ β liml − β | Z d → N 0, −2 22 11 22 + α K 1 − α L 1 − α K − α L 11 22 − 2 12 , 4.7 with a smaller variance than the mbtsls estimator under the same assumptions comparing 4.6 with 4.7 , consistent with the efficiency of liml under those conditions. There is therefore a trade-off between the robustness of the mbtsls estimator to the presence of direct uncorrelated effects and the efficiency of liml in the absence of such effects under Normality and homoscedasticity. 5. TESTING The assumption that the instruments are valid i.e., that γ = 0 is equivalent to restricting the 11,N and thus 12,N elements of the augmented concentration matrix to zero. Several tests of this restriction have been proposed in the literature, most of them in the setting with a fixed number of instruments, but some designed to be robust to the presence of many instruments. The most popular one test, due to Sargan 1958 , is based on the statistic: J Sargan = Y − X ˆ β liml ′ P Z ⊥ Y − X ˆ β liml Y − X ˆ β liml ′ M W Y − X ˆ β liml N = N 1 − ˆκ −1 liml . This statistic can easily be computed as the N · R 2 from re- gressing the estimated residuals in the structural equation on instruments and exogenous regressors. Sargan 1958 showed that under the standard strong instrument asymptotic sequence, which keeps the number of instruments and exogenous regres- sors fixed so that K N = K and L N = L, this statistic satisfies J Sargan d → χ 2 K−1 . Anatolyev and Gospodinov 2011 showed, Downloaded by [Universitas Maritim Raja Ali Haji] at 18:46 11 January 2016 however, that if the number of instruments is allowed to grow with the sample size, the limiting distribution is Normal, and using a critical value based on the χ 2 distribution with K N − 1 degrees of freedom yields an asymptotically conservative test. Anatolyev and Gospodinov 2011 therefore proposed an ad- justment to the critical value. Unfortunately, if the number of exogenous regressors is allowed to grow with the sample size as well, the original as well as the adjusted Sargan test have asymptotic size equal to one Anatolyev 2011 . We therefore propose to use a test statistic suggested by Cragg and Donald 1993 : J Cragg–Donald = N − K N − L N ˆκ liml − 1. Like the Sargan statistic, this statistic depends on the data only through ˆκ liml . Both tests reject for large values of ˆκ liml , so their power properties are identical; the only difference be- tween them is in how well they control size. Under the stan- dard strong instrument asymptotics, this statistic, like the Sar- gan statistic, is also distributed according to χ 2 K−1 . However, under many-instrument asymptotics, using the 1 − ˜α quan- tile of the χ 2 distribution with K N − 1 degrees of free- dom for a test with nominal size ˜ α results in asymptotic size distortions. We therefore compare J Cragg–Donald against the √ 1 − α L 1 − α K − α L −1 1 − ˜α quantile of χ 2 K N −1 , where is the cdf of a standard Normal distribution. Koles´ar 2012 showed that this adjusted Cragg–Donald test controls size under strong, as well as many-instrument asymptotics. 6. TWO APPLICATIONS In this section, we discuss two applications. These will serve to provide further context for the empirical content of the as- sumptions, and in particular the zero correlation assumption Assumption 5. 6.1 Application I The first application is based on Chetty et al. 2011 first introduced in Section 2 . The interest in Chetty et al. 2011 is in the effect of kindergarten performance on later outcomes. Here, we focus on first-, second-, and third-grade performance as the outcome of interest. The outcome equation is Y i = βX i + L N ℓ=1 δ ℓ W iℓ + K N k=1 γ k Z ik + ǫ i . 6.1 Here, the outcome Y i is first-, second-, or third-grade perfor- mance. The endogenous regressor X i is kindergarten perfor- mance. The exogenous regressors W ik include 76 school indica- tors and three demographic variables female, black, and being on subsidized lunches, for a total of L N = 79 exogenous vari- ables. The instruments are K N = 238 classroom indicators. The first stage is X i = K N k=1 π 1,k Z ik + L N ℓ=1 π 2,ℓ W iℓ + ν i . 6.2 The motivation for the zero correlation assumption is that the γ k represent the effects of the first or subsequent, grade teachers. Table 1. Estimates for Chetty et al. 2011 data Standard error Estimator ˆ β classic Bekker many exo 11 Panel I: Grade 1 test scores tsls 0.380 0.038 liml 0.014 0.047 0.052 0.052 btsls 0.221 0.041 0.052 mbtsls 0.215 0.041 0.052 0.052 0.066 Panel II: Grade 2 test scores tsls 0.389 0.044 liml 0.108 0.049 0.057 0.057 btsls 0.234 0.047 0.059 mbtsls 0.226 0.047 0.059 0.059 0.070 Panel III: Grade 3 test scores tsls 0.385 0.048 liml 0.175 0.052 0.061 0.061 btsls 0.238 0.051 0.064 mbtsls 0.230 0.051 0.064 0.064 0.070 NOTE: “classic” refers to conventional standard errors that assume fixed number of instru- ments, “bekker” refer to standard errors based on Bekker 1994 that allow for the number of instruments to increase with the sample size, “many exo” refers to standard errors also allow for many exogenous covariates, and “ 11 0” are standard errors based on Theorem 2 that allow for direct effects of the instruments on outcome. Because the classes largely stay the same from year to year, chil- dren with the same kindergarten teacher would have the same first-, second-, and third-grade teacher. However, by design the subsequent teachers were assigned randomly, independently of the kindergarten teachers, and so the γ k would be independent of the π 1,k if the only direct effect of the kindergarten class- roomteacher assignment was through the subsequent teacher. Finally, we impose a random effects structure on the effects of the instruments on outcomes and endogenous regressors: ˜ γ k ˜ π 1,k

Z, W ∼ N ,