Directory UMM :Data Elmu:jurnal:J-a:Journal of Econometrics:Vol96.Issue1.May2000:

(1)

E

$

ciency results of MLE and GMM

estimation with sampling weights

J.S. Butler

*

Department of Economics, Vanderbilt University, Nashville, TN 37235, USA Received 1 May 1996; received in revised form 1 June 1998

Abstract

This paper examines GMM and ML estimation of econometric models and the theory of Hausman tests with sampling weights. Weighted conditional GMM can be more e$cient than weighted conditional MLE, an ine$cient alternative to full information MLE under choice-based sampling, unless regressions have homoscedastic additive disturbances or sampling weights are independent of exogenous variables. GMM vari-ances are necessarily smaller without sampling weights if GMM is the same as MLE or disturbances are homoscedastic, but not in general. Taking into account the dependence

of sampling weights on parameters improves the e$ciency of estimation. ( 2000

JEL classixcation: C90; C42; C25

Keywords: GMM; Heteroscedasticity; MLE; Sampling weights

1. Introduction

Econometrics texts rarely refer to sampling weights, and sample design texts rarely refer to econometric models. Thus, neither has considered special proper-ties of weighted estimators of econometric models. Based on choice-based sampling discussed in Manski and Lerman (1977) and Manski and McFadden

*Corresponding author. Tel.:#1-615-322-2871; fax:#1-615-343-8495. E-mail address:[email protected] (J.S. Butler)

(2)

(1981), this paper proves several propositions about weighted estimation. The analysis is in the spirit of sample design (Cochran, 1977) in studying the e!ects of weights on variances and comparisons of conditional ML and GMM estima-tion. The variance of a generalized method of moments (GMM) estimator can be smaller when sampling weights are employed in the estimation than when they are not employed. When sampling weights are employed in the estimation, weighted conditional GMM estimation can be more e$cient than weighted conditional maximum likelihood estimation (MLE), even if the weights are measured without error, and the Hausman test may be theoretically invalid. This contrasts with the standard result that without sampling weights, MLE is always at least as e$cient as GMM estimation.

The models in this paper include linear and nonlinear regression with addi-tive, possibly heteroscedastic disturbances and limited-dependent variable models in which the probability of the discrete outcomes is known and the distribution of the exogenous variables is known and either discrete or continu-ous (`_{estimation with}_p_{[pdf of the exogenous variables] and}_Q_{[probabilities of} discrete outcomes] both knowna_{(Manski and McFadden, 1981, p. 13). Taking} into account any dependence of sampling weights on the parameters which appear in the score function improves the e$ciency of the estimation, absent a speci"c linear dependency. Recent papers examining the use of ancillary data to provide weights and moment conditions to improve on the e$ciency of the Manski and Lerman (1977) estimator include Imbens (1992), Lancaster and Imbens (1996), and Imbens and Hellerstein (1996).

Section 2 summarizes established results concerning sampling weights. Sec-tion 3 derives, in estimaSec-tion using sampling weights, the covariance between weighted conditional ML and GMM estimates and the associated Hausman test, Section 4 considers linear and nonlinear regressions, Section 5 examines probit models speci"cally, and Section 6 shows the e$ciency gains from model-ling the sampmodel-ling weights parametrically. Section 7 concludes the paper.

2. Sampling weights

Sample design shows that in the absence of additional information simple random samples and the resulting unweighted estimation minimize the vari-ances of estimated means, but simple random samples are not used for various reasons. The cost of collecting observations from di!erent strata (subpopula-tions) may be di!erent; the variances within strata may be di!erent; response rates may be di!erent, or a sample may be collected to allow comparisons of two di!erent groups, for which equal samples in both strata are optimal, or to emphasize one group, but then used to estimate population relationships. The question then arises whether sampling weights are needed in an analysis. If sampling is based on exogenous variables and interest is in the parameters of the

(3)

conditional distribution of the endogenous variables conditional on the exo-genous variables, then sampling weights are not needed and generally, but not always, reduce the e$ciency of estimation if they are used. If sampling is based on endogenous variables or both exogenous and endogenous variables, then sampling weights, a priori or estimated consistently, are required to achieve consistency. The results of Manski and Lerman (1977) on choice-based sampling, endo-genous sampling usually but not necessarily based on discrete outcomes (Haus-man and Wise (1981) and Imbens and Hellerstein (1996) consider continuous outcomes), show that the probability limit of the log likelihood function with choice-based sampling isH(i)/Q(i) times the population log likelihood function, whereH(i) is the sample proportion andQ(i) is the population proportion in discrete outcomei.Q(i) may be known a priori or estimated as a function of the parameters.Q(i) must be a function of the parameters of the model, butH(i) and the ratioH(i)/Q(i) may or may not be.

In practice, research is likely to use weighted conditional MLE, a choice-based sampling correction suggested by Manski and Lerman (1977). This multiplies the log likelihood for the sample actually drawn timesQ(i)/H(i), which is estimated outside the sample, perhaps based on a census or other large data set, to adjust for the implicit factor H(i)/Q(i). The sampling weights are not explicitly written as functions of the parameters of the model, generally reducing the e$ciency of the estimator. Taking into account the way in which weights depend on parameters is full information maximum likelihood (Cosslett, 1981), but it is also di$cult to apply, as it may require estimating the joint distribution of the exogenous variables.

Even given error-free weights, weighted conditional MLE can be less e$cient than weighted conditional GMM, and an attempt to de"ne a Hausman test can be invalid, making speci"cation testing more di$cult. These anomalies are not universal, however, and do not appear in standard sample design mainly because they do not arise for weights stochastically independent of the exogenous variables, as, for example, when means are estimated (the only exogenous variable is a constant).

Weighted conditional MLE is consistent and asymptotically normal but ine$cient. Imbens (1992), Imbens and Hellerstein (1996), and Imbens and Lancaster (1996) emphasize the use of additional data from a source such as a census to estimate marginal probabilities to reduce the ine$ciency. The relationship between the full information procedure and the Manski}Lerman procedure is considered in more detail in Section 6.

3. A Hausman test derivation with sampling weights

The derivation of a Hausman test is also a proof that MLE is the most e$cient asymptotically linear estimator not using sampling weights. This

(4)

section adapts the derivation to account for sampling weights. Note that expectations of score functions, orthogonality conditions, variances, and covariances are taken over the distribution of the variables induced by the sample design, not the population distribution.

The derivation requires the covariance between the MLE and an asymp-totically linear competitor, e.g. a GMM estimator. De"ne the weighted condi-tional ML and GMM estimators using

L_H_h_(h)"1

t/1

tst(h), (1)

m(h)"1

t/1

twt(h), (2)

wheresis the set of score functions, one for each ofrparameters, andwis a set of

qorthogonality conditions, withq*r. Score functions and orthogonality

con-ditions have mean zero at the true parameter vectorh

0. De"neX"ds/dhand D"<₍s). Letιbe a column of ones. Then the covariance betweenmandLH_h is

E(w2_t s

tw@t)"

P

w2tstw@tftdz"

P

w2t R_f

t R_h 1

t w@

tftdz

P

t R_f

t R_hw@tdz

"w2_tιf

tw@t!

P

w2tft R_w

R_h_@ dz. (3)

The term inιis zero if the pdf is zero at the limits of the range of the data. The GMM "rst derivatives matrix D(Hansen, 1982), of dimensionsqbyr, which involves linear weights is calledD

8to distinguish it fromD88, which involves squared weights. Previous analyses of GMM have not made this distinction based on sampling weights.

E(w2_ts

tw@t)"!D@88,r]q. (4) The covariance between the GMM estimator and the MLE is

Cov(hK_MLE,hK_GMM)"X~1D@

88<~188D8(D@8<~188D8)~1. (5) The resulting variance matrix of the GMM and ML estimators is

< Jn(hKMLE

!h 0)

Jn(hK_GMM!h 0)

C

X~1*X~1 X~1D@

88<88~1(m)D8(D@8<88~1(m)D8)~1 (D@

8<88~1(m)D8)~1D@8<88~1(m)D88X~1 (D@8<88~1(m)D8)~1

D

, (6)

(5)

which implies, in the special case of no overidentifying conditions (r"q), <_(h_K

MLE!hKGMM)"[D~18 <88(m)(D~18 )@!D~18 D88X~1

!X~1D@

88(D~18 )@#X~1*X~1]/n, (7) on which the Hausman test, if valid, would be based. The more general form in Eq. (6) is used when there are overidentifying conditions.

The attempt to derive the Cramer}Rao or e$ciency bound is complicated by sampling weights. The resulting expression is not a variance bound de"ned without reference to an estimator but instead is a limit depending on the covariance of an alternative estimator with the MLE. The standard form of the Cramer}Rao bound is

<_(h_K

GMM)*Cov(hKGMM,LHh) (<(LHh))~1Cov(LHhK@,hK

GMM), (8)

and without sampling weights Cov(hK_GMM,L_H

h)"I, but with sampling weights, Cov(hK_GMM,L_H_h₎"(D@

8(<88(m))~1D8)~1(D@8(<88(m))~1D88). (9) A series of propositions follows. Whenever weighted and unweighted vari-ances are compared, it is assumed that sampling weights would not be employed unless they are necessary for consistent estimation (e.g. choice-based sampling) or increase e$ciency if they are used.

Proposition 1. Without sampling weights, for example in a simple random sample,

MLE is at least as ezcient as GMM.

Proof. X"D,D

8"D88"D,<(hKMLE!hKGMM)"(D~1<(m)(D~1)@!X~1)/n* 0, soD_~1<₍m)(D_~1)@*X~1and MLE is at least as e$cient as GMM. This is

well known but does not extend to weighted estimation. _h

The following matrix algebraic results are used to prove several propositions. Given n]k matrices A and B with nonsingular A@A,A@B, and

B@B, (B@A)~1B@B(A@B)~1*(A@A)~1, but (B@A)~1B@HB(A@B)~1!(A@A)~1A@HA

(A@A)~1) is positive semide"nite (psd) for all psdn]nmatricesHif and only if the columns of B are linear functions of the columns of A,B"AC,

C"(A@A)~1A@B. For the proof, see the appendix.

Proposition 2. Thevariance of the weighted conditional MLE is at least as large as

thevariance of the unweighted conditional MLE.

Proof. Let rowtofBbes

twt, the score function for persontmultiplied by the sampling weight for person t. Let row t of Abe s

t. (B@A)~1B@B(A@B)~1 is the variance with sampling weights and (A@A)~1(A@A)(A@A)~1"(A@A)~1is the vari-ance under a simple random sample.

(6)

The case of sampling weights which are stochastically independent of the exogenous variables has little relevance to econometrics, where weights always depend on explanatory variables or choices, but this special case explains basic results in sample design, because a sample mean is the "tted value from regression on a constant term alone, which is independent of the weights.

Proposition 3. Weighted conditional MLE attains the Cramer}Rao bound and is at

least as ezcient as weighted conditional GMM if the sampling weights are

stochas-tically independent of thevariables in the model.

Proof. When the weights are independent of the exogenous variables, the weights are separable from all expectations, and the bound simpli"es to

<_(h_K₎*(+w2

t/n2)(X~1)*0, (10) which the Manski}Lerman MLE attains, and

<_(h_K

MLE!hKGMM)"(+w2t/n2)(D~1<(m)(D~1)@!X~1)*0. (11) The average squared weight is the design e!ect in sample design (Cochran, 1977). Note that heteroscedasticity is irrelevant in this case. h

4. Regressions with sampling weights

The following proposition covers linear or nonlinear regression with additive disturbances with known but not necessarily normal distributions.

Proposition 4. In estimating a linear or nonlinear regression, unweighted

estima-tion produces a smallervariance than weighted estimation using the optimal GMM

estimator or whenever disturbances are homoscedastic, but does not necessarily do

so for the GMM estimator based on the orthogonality of regressors and disturbances otherwise.

Proof. The model isy

t"Xt@b#et,et&pdfft, and<(e) is unrestricted. MLE is based onX

tbeing orthogonal toft@/ft(which is!etifetis distributed N(0,p2)). For GMM thus de"ned, Proposition 2 applies. GMM is often based onX@e"0.

Let nowtofAbeX

t, rowtofBbeXtwt, andHbe a diagonal matrix with<(et) on the main diagonal. Then (B@A)_~1B@HB(A@B)_~1is the weighted GMM variance and (A@A)~1A@HA(A@A)~1is the unweighted GMM variance. Note that homos-cedastic disturbances makeH"I(AandBare divided byp), and variances rise when sampling weights are employed.

(7)

In nonlinear models, a common GMM framework assumes y

t" E(y

tDXt@b)#et, whereXtis uncorrelated with usually heteroscedasticet. In this case, MLE requiresX

tto be orthogonal to (ft@/ft)(dE(ytDXt@b)/d(Xt@b)), and GMM based on X@e"0 makes row t of A be X

t(dE(ytDXt@b)/d(Xt@b)), row t of B be

twt, and H be a diagonal matrix with <(et) on the main diagonal. (B@A)~1B@HB(A@B)~1is the weighted GMM variance, and (A@A)~1A@HA(A@A)~1 is the unweighted GMM variance.

The analysis of sample means, homoscedastic errors, normal distributions, and MLE may obscure the fact that weighted GMM variances may be smaller than unweighted GMM variances, for example, in a simple random sample. This has the interesting implication that e$ciency of GMM estimation might im-prove if sampling weights were applied even if the sample is a simple random sample; this cannot happen under MLE. Weighted GMM variances cannot be smaller than unweighted GMM variances if disturbances are homoscedastic, but that is rare in nonlinear models.

The next section works out exact results for probit models.

5. Variances of probit models estimated by weighted conditional MLE and GMM The probit model is very familiar, but the results in this paper are not, so the derivation is fairly explicit.

The probit model is given by the following expressions, wherew

tis a sampling weight with mean unity,yH_t is a latent, normally distributed random variable,

tis an observable indicator variable equal to 0 or 1,Xtis a vector of exogenous variables, bis a vector of parameters to be estimated, and e

tis a disturbance distributed N(0, 1).

t"1 if and only ifyHt'0, 0 if and only ifyHt)0. (12) The Manski}Lerman (1977) choice-based log likelihood function is

L_H"1

t/1

tlnU((2yt!1)Xt@b) (13) with"rst derivatives

L_H b" 1 n n + t/1 [w

tst]" 1 n n + t/1

A

tXt(2yt!1) /((2y

t!1)Xt@b) U((2y

t!1)Xt@b)

B

. (14)

wheres

tis the set of score functions, Hessian X"!1

t/1

tXtXt@

A

(2y

t!1)Xt@b/((2yt!1)Xt@b) U((2y

t!1)Xt@b)

#/2((2yt!1)

tb) U2((2y

t!1)Xt@b)

B

(8)

and outer product"rst derivatives D"1

t/1

A

w2_tX tXt@

/2((2y

t!1)Xt@b) U2((2y

t!1)Xt@b)

B

. (16)

In this analysis, the expectation over the distribution ofe

tis substituted when-every

tappears, by assuming that the sample size is large for every value ofXt. Each formula takes on a value for y

t"0 with probability U(!Xt@b) and a di!erent value fory

t"1 with probabilityU(Xt@b). E

A

(2yt!1)Xt@b/((2yt!1)Xt@b)

U((2y

t!1)Xtb)

B

"0, (17)

A

/2((2yt!1)Xt@b) U2((2y

t!1)Xt@b)

B

" /2(

tb) U(X@

tb)(1!U(Xt@b))

. (18)

The variance of weighted probit MLE is (Manski and Lerman, 1977) <_(b_K

MLE)"X~1*X~1, (19)

whereXand*di!er only in the use ofw

tandw2t.

Turning to GMM, one possible set of orthogonality conditions for this problem (Avery et al., 1983) is

m"1

t/1 [w

twt]" 1 n n + t/1 [w

tXt(yt!/(Xt@b))]. (20) Other orthogonality conditions are possible, such as those which de"ne the MLE, but the ones used here, based on orthgonality ofX

tand the prediction errors, are a convenient set frequently used in practice that can generate a more e$cient estimator than weighted conditional MLE. Adding more orthogonality conditions could make GMM even more e$cient.

D 8" dm db"! 1 n n + t/1 [w

tXtXt@/(Xt@b)], (21) which does not involve y

t and thus requires no expectation to be taken. The variance of the orthogonality conditions is

<₍m)"1

t/1 [w2_tX

tXt@(yt!U(Xt@b))2], (22) with expectation over the distribution ofe

tand thusyt <₍_m₎"1

t/1 [w2_tX

(9)

The variance of the GMM estimator is <_(b_K

GMM)"D~18 <(m)(D~18 )@. (24) The GMM variance in Eq. (24) is considered in Proposition 5 and compared to the MLE variance in Eq. (19) in Proposition 6.

Proposition 5. The variance of the unweighted probit GMM estimator can be

greater than thevariance of the weighted probit GMM estimator.

Proof. Let rowtofBbe/

twtXt@, rowtofAbe/tXt@, andHbe a diagonal matrix withH

tt"/t(1!/t)/U2t, the inverse residual variance. The weighted variance is (B@A)~1B@HB(A@B)~1, and the unweighted variance is (A@A)~1A@HA(A@A)~1. h

Proposition 6. In weighted probit estimation, the conditional GMM estimator can

be more ezcient than the conditional MLE.

Proof. Let row t of B be X@

tJ[wtUt(1!Ut)], row t of A be Xt@/t

J[w

t/(Ut(1!Ut))], and H be a diagonal matrix with Htt"wt. Then (B@A)~1B@HB(A@B)~1 is the variance of the conditional GMM estimator and (A@A)~1A@HA(A@A)~1 is the variance of the conditional MLE. Unweighted estimation makesH"Iand makes the conditional MLE at least as e$cient as the conditional GMM estimator, by Proposition 1. _h

6. E7ciency gains from modelling the weights parametrically

Weighted conditional GMM can be more e$cient than weighted conditional MLE, because of the ine$ciency considered in Proposition 7. This is a problem of the second best, and one deviation from full e$ciency can make another one be an improvement.

Proposition 7. Taking any dependence of the sampling weights on the parameters

which appear in the score function into account improves the ezciency of the

weighted conditional MLE.

Maximizing1

t/1

t(h) ln(ft(h)) implies that E

A

t/1

A

tst#(lnft) dw

(10)

The"rst part of the sum in Eq. (25) is the Manski}Lerman weighted conditional MLE, which has mean zero,

A

t/1 (w

tst)

B

"0, (26)

and by itself leads to a consistent estimator because the weighted score function has mean zero. The second part of the sum also has mean zero,

A

n n + t/1

A

(lnf t) dw t

BB

"0, (27)

mathematically because subtracting Eq. (26) from Eq. (25) says so, and logically because when the log likelihood is speci"ed correctly, the weights, which have the property that+_nt/1w

t"nand thus+nt/1(dwt/dh)"0, are functions of the parameters such that the weighted sum of log likelihoods is maximized. Estima-tion could be based on Eq. (26) alone, that is, weighted condiEstima-tional MLE, or Eq. (27) alone, but combining Eqs. (26) and (27) is more e$cient. Let

A

t/1 (w

tst)

B

, b" 1 n n + t/1

A

(lnf t) dw t dh

B

, <

11"<(a),<12"Cov(a,b), and<22"<(b). A standardm-estimation analysis of the two sets of equations implies that

<₍J_n_(h_K!h 0))"

C

da dh<~111

da dh

A

dh<~111<12! db

B

(<22!<21<~111<12)~1

A

< 21<~111

da dh!

A

BBD

. (28)

Ifbis written as the best multivariate linear approximation toaplus a nonlinear componentc_{(h) and a residual vector}_e_then

t"a@t<~111<12#c@t(h)#e@t (29) and <

22!<21<~111<12"<(c)#<(e), so that by di!erentiating Eq. (29) and substituting into Eq. (28), and nothing that<

11"Dand E(da/dh)"X. <₍J_n_(h_K!h

0))"

C

X*~1X# dc

dh(<(c)#<(e))~1

A

dc dh

BD

(11)

Eq. (27) are linearly dependent on the terms in Eq. (26). Improving the model-ling of the weights by increasing <₍c) and dc/dh and decreasing <_(e) increases the e$ciency of the weighted estimation, as for example in the papers by Imbens (1992), Imbens and Lancaster (1996), and Imbens and Hellerstein (1996).

7. Conclusions

Weighted conditional maximum likelihood estimation as proposed by Manski and Lerman (1977) is not full information MLE taking into account the dependence of the population proportion of outcomes on the estimated para-meters, so one might conclude that all e$ciency claims of MLE are lost. On the other hand, one might think sampling weights do not a!ect e$ciency results because of sample design results based on sample means and regressions with homoscedastic disturbances. This paper compares the variances of weighted conditional MLE and GMM estimation and generalizes the theory of Hausman tests to include sampling weights. Weighted conditional MLE is at least as e$cient as weighted conditional GMM in unweighted linear or nonlinear regressions with additive, homoscedastic disturbances, or in models with samp-ling weights stochastically independent of exogenous variables. In general, weighted conditional GMM can be more e$cient.

The e!ect of sampling weights on GMM variances is as follows: simple random samples minimize GMM variances when GMM estimation is the same as MLE or when residuals are homoscedastic. Otherwise, variance comparisons under simple and strati"ed random samples depend on the values of the weights and the data, and unweighted estimation can lead to larger variances.

Adding terms representing the parametric dependence of the sampling weights on the parameters appearing in the score function generally increases the e$ciency of the estimation unless a speci"c linear dependency exists in the model.

Acknowledgements

The author thanks Amy Crews, participants in a seminar at the 1997 Econometric Society Australasian Meetings, River Huang, and an anonymous referee for comments on this paper. All errors are the responsibility of the author.

(12)

Appendix A. Matrix algebraic results

Given n]k matrices A and B with nonsingular A@A,A@B and B@B, (B@A)~1B@B(A@B)~1*(A@A)~1, but (B@A)~1B@HB(A@B)~1!(A@A)~1A@HA

(A@A)~1is positive semide"nite (psd) for all psd n]nmatricesH if and only if the columns of B are linear functions of the columns of A,B"AC,C"

(A@A)~1A@B.

H"I, and (B@A)~1B@HB(A@B)~1*(A@H~1A)~1, the &IV' form, with equality

whenB"ACandH"I. Despite the extra condition on the&IV'form relative

to the &OLS' form, no de"nite relationship exists in general between them. Assume thatF@HF!G@HGis psd. Then restate

F@HF!G@HG"+n

r/1 n

c/1

rc(FrF@c!GrG@c), (A.1) and choose H to be zero everywhere except element H

rr and pre- and post-multiplyF@HF!G@HG by a 1]nvector with 1 in positioni, a nonzero real numberain positionjand 0 elesewhere, to obtain

rr((Fri#aFrj)2!(Gri#aGrj)2)*0. (A.2) Choosinga"!F

ri/Frj impliesGri#aGrj"0, soa"!Gri/Grj. Since this is true for allr,i, andj, 1)r)n, 1)i)kand 1)j)k,Fmust be proportional toG, so letting the factor of proportionality beb,

(B@A)~1B@"b(A@A)~1A@. (A.3)

Post-multiplying byAshows thatb"1, and taking the outer product of each side of Eq. (A.3), with some algebra, shows thatB"AC,C"(A@A)~1A@B. This result shows that an IV estimator can be more e$cient than OLS under heteroscedasticity, along with other results in the paper.

Here is an example. If A@ is a vector, (#1 #1), B@ is a vector, (#1 c),

H has elements 1 and 3 on the main diagonal and 0 o! it, thenA@A"2,A@B"1#c,A@HA"4,B@HB"1#3c2, (B@A)~1B@HB(A@B)~1"

(1#3c2)/(1#c2) and (A@A)~1A@HA(A@A)~1"1, and (B@A)~1B@HB(A@B)~1(

(A@A)~1A@HA(A@A)~1if 0(c(1. ForHwith 1 andd'1 on the main diagonal

in this problem, (3!d)/(!1#3d)(c(1 makes&IV'estimation more e$cient than&OLS'estimation.

(13)

References

Avery, R., Hansen, L., Hotz, J., 1983. Multiperiod probit models and orthogonality condition estimation. International Economic Review 24, 21}35.

Cochran, W.G., 1977. Sampling techniques (3rd Edition). Wiley, New York.

Cosslett, R., 1981. Maximum likelihood estimation for choice-based samples. Econometrica 49, 1289}1316.

Hansen, L.P., 1982. Large sample properties of generalized method of moments estimators. Econo-metrica 50, 1029}1054.

Hausman, J., Wise, D.A., 1981. Strati"cation on endogenous variables and estimation: the Gary income maintenance experiment. In: Manski, C., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, pp. 365}391.

Imbens, G., 1992. An e$cient method of moments estimator for discrete choice models with choice-based sampling. Econometrica 60, 1187}1214.

Imbens, G., Hellerstein, J.K., 1996. Imposing moment restrictions from auxiliary data by weighting. Working Paper, Harvard University.

Lancaster, T., Imbens, G., 1996. Case-control studies with contaminated controls. Journal of Econometrics 71, 145}160.

Manski, C., Lerman, R., 1977. The estimation of choice probabilities from choice-based samples. Econometrica 45, 1977}1988.

Manski, C.F., McFadden, D., 1981. Alternative estimators of sample design for discrete choice analysis. In: Manski, C.F., McFadden, D. (Eds.), Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, MA (Chapter 1).

(1)

and outer product"rst derivatives D"1

t/1

A

w2_tX

tXt@

/2((2y

t!1)Xt@b)

U2((2y

t!1)Xt@b)B

. (16)

In this analysis, the expectation over the distribution ofe

tis substituted

when-every

tappears, by assuming that the sample size is large for every value ofXt.

Each formula takes on a value for y

t"0 with probability U(!Xt@b) and

a di!erent value fory

t"1 with probabilityU(Xt@b).

A

(2yt!1)Xt@b/((2yt!1)Xt@b) U((2y

t!1)Xtb)

B

"0, (17)

A

/2((2yt!1)Xt@b) U2((2y

t!1)Xt@b)B

" /2(

tb)

U(X@

tb)(1!U(Xt@b))

. (18)

The variance of weighted probit MLE is (Manski and Lerman, 1977) <_(b_K

MLE)"X~1*X~1, (19)

whereXand*di!er only in the use ofw

tandw2t.

Turning to GMM, one possible set of orthogonality conditions for this problem (Avery et al., 1983) is

m"1 n n + t/1 [w

twt]"

1 n n + t/1 [w

tXt(yt!/(Xt@b))]. (20)

Other orthogonality conditions are possible, such as those which de"ne the MLE, but the ones used here, based on orthgonality ofX

tand the prediction

errors, are a convenient set frequently used in practice that can generate a more e$cient estimator than weighted conditional MLE. Adding more orthogonality conditions could make GMM even more e$cient.

D 8" dm db"! 1 n n + t/1 [w

tXtXt@/(Xt@b)], (21)

which does not involve y

t and thus requires no expectation to be taken. The

variance of the orthogonality conditions is <₍m)"1

t/1 [w2_tX

tXt@(yt!U(Xt@b))2], (22)

with expectation over the distribution ofe

tand thusyt

<_(m)"1 n

t/1 [w2_tX

(2)

The variance of the GMM estimator is <_(b_K

GMM)"D~18 <(m)(D~18 )@. (24) The GMM variance in Eq. (24) is considered in Proposition 5 and compared to the MLE variance in Eq. (19) in Proposition 6.

Proposition 5. The variance of the unweighted probit GMM estimator can be greater than thevariance of the weighted probit GMM estimator.

Proof. Let rowtofBbe/

twtXt@, rowtofAbe/tXt@, andHbe a diagonal matrix

withH

tt"/t(1!/t)/U2t, the inverse residual variance. The weighted variance is

(B@A)~1B@HB(A@B)~1, and the unweighted variance is (A@A)~1A@HA(A@A)~1. h Proposition 6. In weighted probit estimation, the conditional GMM estimator can be more ezcient than the conditional MLE.

Proof. Let row t of B be X@

tJ[wtUt(1!Ut)], row t of A be Xt@/t

J[w

t/(Ut(1!Ut))], and H be a diagonal matrix with Htt"wt. Then

(B@A)~1B@HB(A@B)~1 is the variance of the conditional GMM estimator and (A@A)~1A@HA(A@A)~1 is the variance of the conditional MLE. Unweighted estimation makesH"Iand makes the conditional MLE at least as e$cient as the conditional GMM estimator, by Proposition 1. _h

6. E7ciency gains from modelling the weights parametrically

Proposition 7. Taking any dependence of the sampling weights on the parameters which appear in the score function into account improves the ezciency of the weighted conditional MLE.

Maximizing1 n

t/1 w

t(h) ln(ft(h)) implies that

A

1 n

t/1

A

tst#(lnft)

(3)

The"rst part of the sum in Eq. (25) is the Manski}Lerman weighted conditional MLE, which has mean zero,

A

1 n

t/1 (w

tst)

B

"0, (26)

and by itself leads to a consistent estimator because the weighted score function has mean zero. The second part of the sum also has mean zero,

A

1 n

t/1

A

(lnf

BB

"0, (27)

mathematically because subtracting Eq. (26) from Eq. (25) says so, and logically because when the log likelihood is speci"ed correctly, the weights, which have the property that+_nt_/1w

t"nand thus+nt/1(dwt/dh)"0, are functions of the

parameters such that the weighted sum of log likelihoods is maximized. Estima-tion could be based on Eq. (26) alone, that is, weighted condiEstima-tional MLE, or Eq. (27) alone, but combining Eqs. (26) and (27) is more e$cient. Let

A

1 n

t/1 (w

tst)

B

, b"

1 n

t/1

A

(lnf

B

, <

11"<(a),<12"Cov(a,b), and<22"<(b). A standardm-estimation analysis of the two sets of equations implies that

<₍J_n(h_K!h 0))"

C

da dh<~111

da dh #

A

dh<~111<12! db dhB(<

22!<21<~111<12)~1

A

21<~111 da dh!

A

dhBBD. (28)

Ifbis written as the best multivariate linear approximation toaplus a nonlinear componentc_{(h) and a residual vector}_e_then

t"a@t<~111<12#c@t(h)#e@t (29)

and <

22!<21<~111<12"<(c)#<(e), so that by di!erentiating Eq. (29) and substituting into Eq. (28), and nothing that<

11"Dand E(da/dh)"X. <₍J_n(h_K!h

0))"

C

X*~1X# dc

dh(<(c)#<(e))~1

A

dc dhBD~1

(4)

Several conclusions follow from Proposition 7. Weighted estimation is, in general, more e$cient when the dependence of the weights on the parameters is taken into account, but there is no gain if c₅0 and e50, i.e. the terms in Eq. (27) are linearly dependent on the terms in Eq. (26). Improving the model-ling of the weights by increasing <₍c) and dc/dh and decreasing <_(e) increases the e$ciency of the weighted estimation, as for example in the papers by Imbens (1992), Imbens and Lancaster (1996), and Imbens and Hellerstein (1996).

7. Conclusions

Acknowledgements

(5)

Appendix A. Matrix algebraic results

Given n]k matrices A and B with nonsingular A@A,A@B and B@B, (B@A)~1B@B(A@B)~1*(A@A)~1, but (B@A)~1B@HB(A@B)~1!(A@A)~1A@HA (A@A)~1is positive semide"nite (psd) for all psd n]nmatricesH if and only if the columns of B are linear functions of the columns of A,B"AC,C" (A@A)~1A@B.

Proof. The "rst result uses the fact thatB@(I!A(A@A)~1A@)B*0 and shows that homoscedastic OLS is at least as e$cient as an IV estimator. It follows both that (A@A)_~1A@HA(A@A)_~1*(A@H_~1A)_~1, the&OLS'form, with equality when H"I, and (B@A)_~1B@HB(A@B)_~1*(A@H_~1A)_~1, the &IV' form, with equality whenB"ACandH"I. Despite the extra condition on the&IV'form relative to the &OLS' form, no de"nite relationship exists in general between them. Assume thatF@HF!G@HGis psd. Then restate

F@HF!G@HG"+n

r/1

c/1 H

rc(FrF@c!GrG@c), (A.1)

and choose H to be zero everywhere except element H

rr and pre- and

post-multiplyF@HF!G@HG by a 1]nvector with 1 in positioni, a nonzero real numberain positionjand 0 elesewhere, to obtain

rr((Fri#aFrj)2!(Gri#aGrj)2)*0. (A.2)

Choosinga"!F

ri/Frj impliesGri#aGrj"0, soa"!Gri/Grj. Since this is

true for allr,i, andj, 1)r)n, 1)i)kand 1)j)k,Fmust be proportional toG, so letting the factor of proportionality beb,

(B@A)~1B@"b(A@A)~1A@. (A.3)

Here is an example. If A@ is a vector, (#1 #1), B@ is a vector, (#1 c), H has elements 1 and 3 on the main diagonal and 0 o! it, thenA@A"2,A@B"1#c,A@HA"4,B@HB"1#3c₂, (B@A)_~1B@HB(A@B)_~1" (1#3c₂)/(1#c₂) and (A@A)_~1A@HA(A@A)_~1"1, and (B@A)_~1B@HB(A@B)_~1( (A@A)~1A@HA(A@A)~1if 0(c(1. ForHwith 1 andd'1 on the main diagonal in this problem, (3!d)/(!1#3d)(c(1 makes&IV'estimation more e$cient than&OLS'estimation.