Manajemen | Fakultas Ekonomi Universitas Maritim Raja Ali Haji 073500103288619124

(1)

Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=ubes20

Download by: [Universitas Maritim Raja Ali Haji] Date: 13 January 2016, At: 01:06

Journal of Business & Economic Statistics

ISSN: 0735-0015 (Print) 1537-2707 (Online) Journal homepage: http://www.tandfonline.com/loi/ubes20

Iterative and Recursive Estimation in Structural

Nonadaptive Models

Sergio Pastorello, Valentin Patilea & Eric Renault

To cite this article: Sergio Pastorello, Valentin Patilea & Eric Renault (2003) Iterative and

Recursive Estimation in Structural Nonadaptive Models, Journal of Business & Economic Statistics, 21:4, 449-509, DOI: 10.1198/073500103288619124

To link to this article: http://dx.doi.org/10.1198/073500103288619124

Published online: 01 Jan 2012.

Submit your article to this journal

Article views: 71


(2)

Iterative and Recursive Estimation in

Structural Nonadaptive Models

Sergio P

ASTORELLO

Dipartimento di Scienze Economiche, Università di Bologna, 40126 Bologna,Italy

Valentin P

ATILEA

LEO, Université d’Orléans, 3100 Tours, France

Eric R

ENAULT

CIRANO and CIREQ, Université de Montréal, Montréal, QC, H3C 3J7 Canada (Eric.Renault@UMontreal.CA) An inference method, called latent backŽtting, is proposed. This method appears well suited for

econo-metric models where the structural relationships of interest deŽne the observed endogenous variables as a known function of unobserved state variables and unknown parameters. This nonlinear state-space spec-iŽcation paves the way for iterative or recursive EM-like strategies. In the E steps, the state variables are forecasted given the observations and a value of the parameters. In the M steps, these forecasts are used to deduce estimators of the unknown parameters from the statistical model of latent variables. The proposed iterative/recursive estimation is particularly useful for latent regression models and for dynamic equilib-rium models involving latent state variables. Practical implementation issues are discussed through the example of term structure models of interest rates.

1. INTRODUCTION

The goal of this article is to provide an uniŽed theory for a va-riety of estimation algorithms that can be used as part of a like-lihood analysis of a nonlinear state-space model or as part of a conditional moment-based inference in a structural economet-ric model. Structural econometeconomet-ric models are challenging be-cause the quantities of interest are not directly related to sample moments of the observations. Typically, these quantities may be not only some unknown parameters, but also latent Markov states that enter the observation in a nonanalytical form, as the solution to differential equations.

Although this econometric setting arises in many empirical applications, we consider as a leading example for this arti-cle the econometrics of Žnancial market models. Beginning with the generalized method of moments (GMM) as applied by Hansen and Singleton (1982), econometric analysis of asset pricing models is dedicated mainly to an inversion problem: re-covering the pricing kernel or stochastic discount factor (SDF) from the observed asset prices.

Empirical asset pricing theory takes as given a set ofJ dis-cretely observed asset pricesYt .Yt2RJ/and tries to recover the SDF mtC1, which relates them to the vectorGtC1of their payoffs by the asset pricing model,

YtDE[mtC1¢GtC1jIt]; (1.1) whereItis the relevant conditioning information at timet. Note that (1.1) implicitly assumes that prices and payoffs are ob-served at equally spaced intervals, although this assumption is not important.

In the simplest cases, the asset pricing model (1.1) is fully speciŽed by the deŽnition of the joint conditional probability distribution of (mtC1;GtC1) given the observed values at timet of some state variables (which summarize the relevant condi-tioning informationIt) and the value of a vector¸ofqunknown parameters. The two most famous examples of such asset pric-ing models are the Sharpe–Lintner (CAPM) (Sharpe 1964;

Lintner 1965) and the Black–Scholes–Merton option pricing model (Black and Scholes 1973).

It has since been widely documented that both the standard CAPM and Black–Scholes (BS) models generate more often than not empirically untenable results. Typically, there is no hope that a Žxed volatility parameter in the BS option pric-ing formula or some Žxed shares of the wealth invested in the CAPM market portfolio characterize the SDF in a satisfactory way. This has been acknowledged since the mid-1970s through both the Roll critique and the practitioners’ rather bizarre uses of the BS formula. First, the Roll critique pointed out that the wealth portfolio needed for CAPM pricing might not be ob-servable. Second, practitioners have considered the BS pric-ing formula as a measurement equation to recover an implied volatility parameter. In complete contradiction to the original BS model, this BS implied volatility is seen as a stochastic and time-varying summary of the relevant information conveyed by option prices. Similarly, the Roll critique, revisited as the Hansen–Richard critique (see Cochrane 2001), stresses that the conditioning information of agents may not be observable, and that one cannot omit it for inference regarding asset pricing models.

This observation does not invalidate the methodology of sta-tistical inference in the context of structural models of Žnan-cial markets equilibrium. On the contrary, when one imagines that agents on the market observe some state variables that are not those that an econometrician can measure directly but the values of which are precisely incorporated in the observed as-set prices, the addition of state variables is an elegant way of reconciling the equilibrium paradigm and the statistical data on market prices. Althougth the use of the BS option pricing formula will surely lead to a logical contradiction (i.e., dif-ferent values of a volatility parameter are used, whereas its

© 2003 American Statistical Association Journal of Business & Economic Statistics October 2003, Vol. 21, No. 4 DOI 10.1198/073500103288619124 449


(3)

constancy is a maintained assumption), the introduction of a convenient number of state variables will avoid this logical contradiction. This strategy of structural econometric model-ing of option pricmodel-ing errors (see Renault 1997 for a survey) maintains the no-arbitrage principle, whereas even very small pricing errors would open the door (up to transaction costs) for huge arbitrage opportunities.

The resulting statistical speciŽcation of the economic mo-del (1.1) appears as a relationship,

YtDg[Yt¤; ¸]; (1.2)

wheregis a given function completely deŽned by the economic model, possibly in a nonanalytic form, and Yt¤ is a vector of possibly latent state variables that summarize the condition-ing information,It, needed to compute the conditional expec-tation (1.1) when the relevant distributional characteristics are speciŽed up to a vector,¸, of unknown parameters. Typically, the stochastic process (m;G)D fmtC1;GtC1gTtD1will be seen as a function of the path over the lifetime of the asset of the pos-sibly continuous-time stochastic processY¤ of state variables, the probability distribution of which is described by a vectorµ

ofpunknown parameters. Then, the joint deŽnition of the SDF and the vector of payoffs will characterize the vector¸of “pric-ing parameters” as a given function of the vectorµof statistical parameters,

¸D¸.µ /; µ22½Rp: (1.3) This “data augmentation” modeling strategy is often consid-ered a difŽcult challenge for empirical Žnance and has gener-ated a variety of new simulation-based minimum chi-squared methods (surveyed in Tauchen 1997). However, these meth-ods often use a fully parametric model as a simulator, whereas neither the parametric efŽciency nor the semiparametric ro-bustness is guaranteed (see Dridi and Renault 2001 for a dis-cussion). Nonetheless, the Bayesian literature, in particular the explosion of papers in the area of Markov chain Monte Carlo (MCMC), over the past 10 years told us that data augmentation should not amount to “difŽculty augmentation.”

As nicely summarized by Tanner (1996, chap. 4), “all these data augmentation algorithms share a common approach to problems: rather than performing a complicated maximization or simulation, one augments the observed data with ‘stuff’ (la-tent data) that simpliŽes the calculation and subsequently per-forms a series of simple maximizations or simulations. This ‘stuff’ can be the ‘missing’ data or parameter values.” Typi-cally, in our case, this will include both the data augmentation from Y to Y¤ and the parameter augmentation from ¸ to µ. And, instead of increasing the difŽculty, it will allow for sim-pler maximization-based estimation procedures (M estimation or minimum distance) as applied to latent data.

The basic idea of this article can be described by analogy with the MCMC methodology. Consider the following algo-rithm: Given an initial estimatorµ.1/ of the parameters of in-terest, compute a proxyYt¤.1/ of the latent variablesYt¤ from the structural relationships:YtDg[Yt¤.1/; ¸.µ.1//],tD1; : : : ;T. This is the analog of the MCMC step of drawing latent vari-ables given the observvari-ables and an initial draw of the para-meters. Then using the simplicity of the latent model enables one to compute an estimatorµ.2/ of the parameters from the

“draw”Yt¤.1/,tD1; : : : ;T, of the latent data. This is the ana-log of the posterior mean (or a random draw) of the augmented posterior distribution that has been precisely used to obtain sim-plicity. Typically, the latent statistical model that characterizes the stochastic latent data generator is much simpler than the statistical model that characterizes the law of motion of the ob-served processY. Continuingin this fashion, the algorithm gen-erates a sequence of random variables [Yt¤.p/; µ.p/], which are consistent with both the observed data and the structural model, YtDg[Yt¤.p/; ¸.µ.p//]; tD1; : : : ;T: (1.4) Moreover, in the same way that a Žxed-point argument im-plies the convergence of the Markov chain produced by a MCMC algorithm, a Žxed-point argument is going to ensure that the limit [Yt¤.1/; µ.1/]Dlimp!1[Yt¤.p/; µ.p/] of the fore-going algorithm will exist. Thenµ.1/will be a consistent esti-mator (for a sample sizeTgoing to inŽnity) of the true valueµ0

of the parameters, whereasYt¤.1/consistently “estimates” the best proxy ofYt¤ that one could deduce from the exact pricing relationship,

YtDg[Yt¤; ¸.µ 0/

]: (1.5)

We will show that it is even not necessary to iterate the fore-going algorithm to inŽnity. A numberp.T/of iterations going to inŽnity withTat a sufŽcient rate will ensure the consistency and root-T normality of the resulting estimators.

Therefore, although in a recent survey of MCMC methods for Žnancial econometrics, Johannes and Polson (Johannes and Polson 2001) claimed that “in contrast to all other estimation methods, MCMC is a uniŽed estimation procedure, estimating both parameters and state variables simultaneously,” the iter-ative methodology put forward in this article can be seen as the frequentist analog of MCMC and shares most of its advan-tages. Among the three main advantages of MCMC methods highlighted by Johannes and Polson (2001), our iterative proce-dure shares two of them. Besides the aforementioned simulta-neous estimation of parameters and state variables, it is also true that it “exclusively uses conditional simulation, and it therefore avoids numerical maximization and long unconditional state variable simulation. Because of this,: : :estimation is typically extremely fast in terms of computing time.” Of course, when one reads “conditional simulation,” one should understand in the context of classic statistics “estimation” ofYt¤andµ0, irre-spective of the fact that this estimation is simulation-based or not. The important point, particulaly for computational time, is that it is “conditional”; that is, it takes advantage of a previ-ous estimator of the parameters or of the state variables to work directly with the “stuff” (latent data) that simpliŽes the calcula-tion.

Of course the third advantage of MCMC methods—namely, to “deliver exact Žnite sample inference”—is not shared by our frequentist methodology, which replaces the Bayesian par-adigm by an asymptotic one. However, the alleged exact Žnite-sample performance of the Bayesian approach rests on the validity of the speciŽcation of a prior, whereas our asymptotic approach may be much less demanding in terms of parametric speciŽcation. The likelihood-based inference the closest to our methodology is the EM algorithm (Dempster, Laird, and Rubin 1977). In some particular cases, our approach can be interpreted


(4)

as an EM algorithm. However, it is more general for at least two reasons. First, it is not limited to a likelihood framework and can be useful in a number of conditional moment-based inference procedures as are popular in modern semiparametric structural econometrics. The analog of the M step of the EM algorithm (i.e., computation of the “estimator”µ.pC1/from the “data”Yt¤.p/;tD1; : : : ;T/will then be performed by any ex-tremum or minimum-distance principle.

Second, even in its maximum likelihood (ML) type applica-tions, our approach does not resort to EM for the main examples of this article. In these examples, EM theory cannot be applied, because the support of the conditional probability distribution of the latent variables given the observable ones depend on the unknown parameters via the “inversion” of the measurement equation (1.5). The analog of the E step of the EM algorithm (i.e., computation of the “data”Yt¤.pC1/ from the new estima-torµ.pC1/) will resort more to an “implied-state” approach as popularized in Žnance by the practitioners’ use of BS implied volatility parameters.

Typically, as explained earlier, it is often the case that a num-ber of state variables have been introduced only to get rid of ob-served pricing errors with respect to the theoretical asset pric-ing model. In this case, the measurement equation (1.5) will provide, for a given value of the parameters, a one-to-one rela-tionship between observed prices or returnsYtand latent state variablesYt¤. Then implied statesYt¤ can simply be recovered from observations by

YtDg[Yt¤; ¸] () Yt¤Dg¡1[Yt; ¸]: (1.6) Of course, this does not make the inference issue as trivial as it may appear at Žrst sight, because implied states can be recovered only for a given value of the unknown parameters. Even in the simplest case of a nonlinear state-space model de-Žned by the measurement equation (1.5) [conformable to (1.6) with¸D¸.µ /] jointly with a transition equation that speciŽes the dynamics of the latent processY¤as a parametric model in-dexed by the vectorµ of unknown parameters, it turns out that the identiŽcation and efŽciency issues for asymptotic statistical inference aboutµ have been largely ignored by the literature until now.

In the context of multifactor equilibrium models of the term structure of interest rates, Chen and Scott (1993), Pearson and Sun (1994), and Duan (1994) were the Žrst to propose max-imum likelihood estimation using price data of the derivative contract. These articles stressed in particular that because ob-served data are the transformed values of some underlying state variables, and “since the transformations in Žnance generally involve unknown parameters and the inverse transformations may not have analytical solutions, the likelihood functions can be difŽcult to derive” (Duan 1994) and in particular “it will be better to avoid the direct computation of the Jacobian for the inverse transformation” (Duan 1994).

AfŽne-term structure models (DufŽe and Kan 1996; Dai and Singleton 2000) are very popular nowadays, precisely because the bond prices can be calculated quickly as a function of state variables solving a system of ordinary differential equations. It is usually argued that, given the closed-form expressions for bond prices, one can invert anynbond prices into then state

variables and use the implied state variables in the estimation as if they were directly observable.

This common belief is actually wrong on theoretical grounds. The key point is that, because transformations in Žnance gen-erally involve unknown parameters, these unknown parameters will appear in the likelihood function, not only through the para-metric model of the latent state variables, but also as inputs of the inverse transformation and its Jacobian. Besides the afore-mentioned computational difŽculties, the main consequence of that is a modiŽcation of the statistical information, as measured by the Fisher information matrix in the likelihood setting.

The Žrst contribution of this article is a theoretical study of the difference between the statistical information conveyed by the state variables and that conveyed by observed data. This study will be carried out not only in the fully parametric set-ting (implied states maximum likelihood and Fisher informa-tion matrix), but also in more general semiparametric contexts deŽned either by a set of conditional moment restrictions (im-plied states GMM and semiparametric efŽciency bound) or by some quasi-maximum likelihood or any extremum estimation (robustiŽed Fisher information matrix).

Besides its theoretical drawbacks, the aforementioned confu-sion between infeasible efŽcient estimation with hypothetically observed state variables (hereafter “latent” estimation) and ac-tual “implied-states” estimation is misleading for the deŽnition of well-suited estimation algorithms. The other contribution of this article is various algorithmic estimation procedures, Žrst with an iterative viewpoint, and second with a recursive ap-proach. A general asymptotic theory of the proposed estimators is developed.

As far as iterative estimation is concerned, the asymptotic properties of our EM-type estimator cannot be properly as-sessed without a clear characterization of its two main ingredi-ents: asymptotic probability distribution of the infeasible latent estimator on the one hand, and asymptotic contracting feature of the mappingµNTthat generates the sequences of random vari-ablesµ.p/on the other hand:

µ.pC1/D NµT.µ.p//; pD1; : : : ;p.T/: (1.7) The general asymptotic theory of the estimator µp.T/ posed in this article nests several estimators previously pro-posed in the literature. Renault and Touzi (1996) considered, for the purpose of option pricing with stochastic volatility, the case of ML withp.T/D 1. Renault (1997) sketched the case of more general extremum estimators and other Želds of ap-plications. Pan (2002) focused on an implied states method of moments that is implicitly considered withp.T/D 1.

Moreover, the structural econometric model of interest, (1.5), may be not only any asset pricing model, but also other equilib-rium models, as produced in particular by game theory. Florens, Protopopescu, and Richard (2001) considered the case of auc-tion markets. Our estimator also nests the iterative least squares (ILS) estimator as proposed by Gouriéroux, Monfort, Renault, and Trognon (1987) and studied extensively by Dominitz and Sherman (2001) in the context of semiparametric binary choice models. We show that also in this latter context, it is worth dis-entangling the semiparametric efŽciency concepts associated with latent estimation and implied-states estimation.


(5)

It is often the case that the Žltering of the latent variables (i.e., computation of the implied states from a given value of the parameters) involves computationallyintensive calculations and/or simulations like in the E step of simulated (SEM) algo-rithms (see, e.g., Ruud 1991). Therefore, we also propose some recursive procedures that, contrary to the iterative ones, involve not the batch processing of the full block of data at each step of the algorithm, but rather only a simple updating of the estimator each time that new data arrive. Moreover, the updating scheme does not imply an optimization procedure (e.g., Young 1985; Kuan and White 1994a, b; Kushner and Yin 1997). Generally speaking, the advantages of a recursive approach are twofold. First, as long as the guess on the parameters is far from the true value, the updating of the guess may be performed using only a small part of the sample, which allows much faster nonlinear Žltering procedures. Once the sequence of recursive guesses stabilizes, one may use these values as starting values for an iterative algorithm. Second, the quick and simple updating for-mula provided by the recursive approach is well suited for on-line estimation.

The price to pay for time-savings through recursive pro-cedures that are directly focused on the implied-states latent moment conditions is, in general, some efŽciency loss with respect to iterative procedures. We also characterize this efŽ-ciency loss in terms of the asymptotic contracting feature of the mapping,µNT. This efŽciency loss should be much smaller than the one produced by less-speciŽc recursive procedures, like the Kalman Žlter, which do not take advantage of the exact non-linear structure of the model and introduce excessive error. For the simplest examples of afŽne-term structure models, we put forward some Monte Carlo results that demonstrate the bet-ter computational and statistical performance of our approach in comparison with traditional Kalman Žlter ones (Duan 1994; De Jong 2000). The superior performance of the exact implied-states recursive procedures proposed here should be even more striking in more nonlinear asset pricing models where the Kalman Žlter no longer admits any theoretical justiŽcation.

The article is organized as follows. In Section 2 we study the general identiŽcation problem for our implied-states approach and compare it with some classical issues of nonadaptivity in econometrics. The regression example allows us to interpret our iterative approach as an extension of classical backŽtting. The identiŽcation condition is further interpreted in the framework of latent regression models and ILS through generalized resid-uals. In Section 3 we specialize the implied states approach to the case (1.6) where the observed variables are one-to-one func-tions of the latent state variables. We show how our approach can apply to GMM and likelihood-typecriteria. We devote Sec-tion 4 to asymptotic theory of the iterative estimators and ex-tend this theory in Section 5 to a recursive version of the same algorithms. The efŽciency loss of the recursive implied-states estimator with respect to the iterative one is also analyzed. The practical issues associated with the implementation of the vari-ous competing approaches are sketched in Section 6 through a short empirical illustration in the context of a model of the term structure of interest rates. Miscellaneous proposals for further empirical research and concluding remarks appear in Section 7. Technical proofs are gathered in the Appendix.

2. IDENTIFICATION AND NONADAPTIVITY In this section we motivate our iterative approach as a natural way to address an issue of nonadaptivity,that is, the occurrence of some nuisance parameters that prevent one from directly ob-taining a consistent estimator of the parameters of interest. The price to pay for the proposed iterative strategy is that the re-quired identiŽcation condition may be stronger than the usual one, although we argue that this strengthening is rather weak.

To support this claim, we consider some benchmark exam-ples of nonadaptivity in econometrics and show that our itera-tive strategy nests some already well-documented procedures. The Žrst example is the partially parametric regression model, in which only some parameters of the parametric part are of in-terest. Our iterative estimation procedure is then tantamount to standard backŽtting, and the required identiŽcation condition appears to be, at least locally, no stronger than the identiŽcation condition of the regression model.

The second example is ILS through generalized residuals of a latent regression model. In this case, the iterative procedure can be interpreted as an extension of standard backŽtting, which we term “latent backŽtting.” The relevant identiŽcation conditions have already been extensively characterized by Dominitz and Sherman (2001), and we just revisit their argument to illustrate that the necessary identiŽcation condition is not much stronger than the usual one.

2.1 The General Framework

The focus of interest is a class of inference problems that can be described via the more general issue of nonadaptivity. We consider a semiparametric statistical model speciŽed by a fam-ily,P, of probability measures,P, on the sampling space and two mappings,µ .¢/:P!2½Rpand¸.¢/:P!0½Rq:The vectorµDµ .P/contains thepparameters representing the fea-tures of interest of the probabilityP, which has hypothetically governed the sampling of the observations.The vector¸D¸.P/

containsq nuisance parameters. The model is assumed to be well speciŽed in the sense that there exists a true unknown probability,P02P, which deŽnes the data-generating process (DGP), andµ0Dµ .P0/and¸0D¸.P0/are the corresponding true unknown values of the parameters of interest and of the nuisance parameters.

We are interested in an extremum estimator ofµ deduced from a sample-based criterion (objective function)

QT[µ ; ¸]DQT

£

µ ; ¸;fYtg1·t·T

¤

;

which depends on vectorsµ 22 and¸20. We consider a Žrst set of assumptions similar to those usually imposed for ex-tremum estimation in the presence of nuisance parameters (see, e.g., Wooldridge 1994, sec. 4).

Assumption 1. a. For anyT ¸1, QT[¢;¢] satisŽes the stan-dard measurability and continuity conditions; that is, it is mea-surable as function of observations and it is continuous as a function of parameters.µ ; ¸/.

b. There exists a limit criterionQ1[¢;¢]DQ1;P0[¢;¢] such that

p lim

T!1QT[µ ; ¸]DQ1[µ ; ¸]; 8.µ ; ¸/22£0:


(6)

The speciŽcity of the general framework considered in this article is the mode of occurrence of the nuisance parameters¸

into the objective functionQT. As noted in Section 1,µ is asso-ciated with the probability distribution of some state variables processY¤, whereas¸appears in the measurement equation

YtDg.Yt¤; ¸/: (2.1) In other words,QT[µ ; ¸0] will typically be an objective function provided by some standard latent extremum estimation princi-ple, whereas¸appears due to the need to recover implied states from the observationsfYtg1·t·T and the measurement equa-tion (2.1).

Because this measurement equation is deŽned by an equi-librium model stemming from, for example, option pricing or game theory, it is itself tightly related to the DGP of the state variables. Therefore, the true unknown value ¸0 of the nui-sance parameters is a function of the true unknown valueµ0

of the parameters of interest:To highlight this point, we write

¸.P/D¸.µ .P//. Note that¸.P/may also depend on some other nuisance parameters insofar as we get a consistent estimator of them. The important point is that there generally is no hope of obtaining a preliminary consistent estimator of¸0 to be able to deduce a consistent estimator ofµ0, because the former is a function of the latter.

In other words, we consider that there is no reason to assume that an extremum estimatorµNT.¸/built as

N

µT.¸/Darg max

µ QT[µ ; ¸] (2.2) from an arbitrarily Žxed value¸of the nuisance parameters will provide a consistent estimation of the true valueµ0of the para-meters of interest. Typically, onlyµNT.¸¤/computed from some particular value¸¤ of the nuisance parameters will provide a consistent estimator. This is the general issue of nonadaptivity. Assumption 2. a. For any¸20, the functionµ !Q1[µ ; ¸] admits a unique maximizerµ .N P0; ¸/;whereP0is the probabil-ity measure governing the observations.

b. There exists some¸¤20such thatµ .N P0; ¸¤/Dµ0. Perhaps the best-known example of nonadaptivity in econo-metrics is the linear regression model with two sets of explana-tory variables, which we now consider for illustration purposes. With a slight change of notation to introduce exogenous vari-ables, let.Yt;Xt/;tD1; : : : ;T, be iid random vectors such that Yt2R;XtD.X10t;X20t/02Rp£Rqand

8 > > > < > > > :

EP.XtX0t/is a nonsingular matrix

EP[Yt¡X10tµ .PX20t¸.P/jX1t;X2t]D0 µ .P/D2DRp and ¸.P/D0DRq;

(2.3)

whereEPdenotes the expectation with respect to the probabil-ityPI in particular,EP0 DE0 stands for the expectation with respect to the DGP.

The ordinary least squares principle leads to the maximiza-tion criterion

Q1[µ ; ¸]D ¡E0.Yt¡X10tµ¡X20t¸/

2: (2.4)

Therefore, the set of values ¸¤ of the nuisance parameters ¸

such that

N

µ .P0; ¸¤/Dµ0

is characterized by a translation of the kernel space of the matrix E0.X1tX02t/,

¸¤¡¸02KerE0.X1tX20t/:

Of course, we are faced with the nonadaptivity problem when the kernel subspace does not coincide with the whole spaceRq. Therefore, except for speciŽc choices of the function¸.¢/, we have

µ06Darg max

µ Q1[µ ; ¸.µ /]: (2.5) In the simple linear model (2.3), the Frisch–Waugh theorem (see Frisch and Waugh 1933; Davidson and MacKinnon 1993, p. 19) provides a particular function,¸0.¢/, to avoid the “prob-lem” (2.5). Indeed, if we deŽne¸0.¢/by

E0[X2tX20t¸ 0.µ /]

DE0[X2t.Yt¡X01tµ /]; (2.6) then

µ0Darg max

µ Q1[µ ; ¸

0.µ /]: (2.7)

In practice, it may happen that the function¸0.¢/is unknown but can be consistently estimated by, say,¸T.¢/:In the Frisch– Waugh example,

¸T.µ /D.X02X2/¡1X02[Y¡X1µ]

is the empirical counterpart of (2.6);Y,X1, andX2are the ma-trices corresponding to the ŽrstT observations of the variables Y,X1, andX2. The Žnite-sample criterion associated with (2.4), where¸is replaced by the function¸T, equals

QT[µ ; ¸T.µ1/]D ¡T¡1 T

X

iD1

[Yt¡X10tµ¡X20t¸T.µ1/]2 D ¡T¡1®®.Id¡PX2/Y¡X1µCPX2X

®2

;

wherePX2DX2.X20X2/¡1X02is the orthogonal projection ma-trix onto the subspace ofRT spanned by the columns ofX2:

The focus of interest of this article is a class of structural econometric models where only a function¸.¢/such that

µ0Darg max

µ Q1[µ ; ¸.µ

0/] (2.8)

is known or is available by estimation, but (2.5) may hap-pen. Therefore, by analogy with standard backŽtting, natural estimation strategies based on the criterionQT should distin-guish the two occurrences ofµ. The criterion is then written as QT[µ ; ¸T.µ1/], and the estimations are deŽned through iterative steps,

µ.pC1/Darg max µ QT

£

µ ; ¸.µ.p//¤; pD1;2; : : : : (2.9) Note that for the sake of notational simplicity, we always letQT[µ ; ¸.µ1/] denote the Žnite-sample criterion, whereas in some cases,¸.µ1/should be replaced by a consistent estimator

¸T.µ1/:In other words, by a slight abuse, all of the data depen-dence is summarized by the notationQT:


(7)

It is important to keep in mind in this respect that all of the it-erative/recursive estimation strategies considered in this article do not change when the criterionQT is replaced by QQT con-formable to

Q

QT

£

µ ; ¸.µ.p//¤DQT

£

µ ; ¸.µ.p//¤CT.µ.p//;

for an arbitrary numerical function’T.¢/deŽned on2. Thus we are deŽnitely unable to check a condition like (2.7).

Moreover, this “limited information” use of the criterionQT requires that we strengthen the standard identiŽcation con-dition. Because by (2.8) we have in mind an infeasible ex-tremum estimator that would be deŽned from the criterion QT[µ ; ¸.µ0/], a basic identiŽcation condition to be imposed should be thatµ0is the unique maximizer of the limit criterion

µ!Q1[µ ; ¸.µ0/]; that is,

N

µ[P0; ¸.µ0/]Dµ0: (2.10) This condition would be quite similar to the usual identiŽca-tion condiidentiŽca-tion that would have been imposed if the limit cri-terion were considered as a function.µ ; ¸/!Q1[µ ; ¸] and if

¸were nuisance parameters for which a consistent estimator is available (see, e.g., Wooldridge 1994, thm. 4.3). But, given the nonadaptivity problem and the need to build an estimation pro-cedure from the steps in (2.9), we need to maintain a stronger identiŽcation condition that imposes that the only Žxed point of the functionµN[P0; ¸.¢/] isµ0. Otherwise, it may happen that the statistician will not be able to reject a “bad guess”µ.p/6Dµ0

used to build the criterionQT[¢; ¸.µ.p//]:This remark leads to the following identiŽcation condition required by our estima-tion strategy.

Assumption 3. µ0Dµ .P0/ is the unique Žxed point of the mapµN[P0; ¸.¢/]:

Note that this assumption implies thatµ0 is well identiŽed from the observations as the unique Žxed point of the func-tion µN[P0; ¸.¢/], where µN[P0; ¸] is the unique maximizer of plimQT[¢; ¸].

We now analyze some conditions ensuring Assumption 3. Consider that the function µN[P0; ¸.¢/] is differentiable with continuous partial derivatives. If µN[P0; ¸.µ0/]Dµ0;then by a Taylor expansion argument we can deduce that (at least locally)

µ0is the unique Žxed point ofµ1! Nµ[P0; ¸.µ1/] provided that the matrix

Ip¡ @µN @µ10[P

0; ¸.µ0/]

is invertible. In terms of the Žrst derivatives of the function

N

µ[P0; ¸.¢/], this appears to be the minimal condition for ensur-ing the unique Žxed point property locally. However,µ0will be estimated by iterations, which requires a stronger condition for ensuring the convergence of the estimator.

Suppose that the function µN[P0; ¸.¢/] is known. Then its unique Žxed point could be approximated through the itera-tions µ.k/D Nµ[P0; ¸.µ.k¡1//];kD1;2; : : :. Note that, in fact, the iterative estimation strategy methodology proposes to re-placeµN[P0; ¸.¢/] with its sample counterpart (2.2) and to per-form the corresponding iterations. A suitable way to ensure the convergence of the iterationsµ.k/is to assume thatµN[P0; ¸.

¢/] is a contracting mapping; that is, there existsc2[0;1/such that

®

®µN[P0; ¸.µ0/]¡ Nµ[P0; ¸.µ00/]®®·ckµ0¡µ00k; µ0; µ0022 :

When µN[P0; ¸.¢/] admits continuous Žrst-order derivatives and µ0 is an interior point of 2, the contracting property is tantamount to the condition

® ® ® ®

@µN @µ10[P

0; ¸.µ0/]

® ® ®

®<1; (2.11)

at least after reducing2. Here and throughout the rest of the article,kAk denotes the spectral norm of the matrix Aand is deŽned bykAk2D½ .A0A/, where ½ .B/ stands for the spec-tral radius of the squared matrixB, that is, its largest eigen-value in absolute eigen-value. Clearly, if .2:11/ holds, then Ip ¡ @µ =@µN 10[P0; ¸.µ0/] is invertible, and thus the unique Žxed point property is guaranteed locally.

In conclusion, the contracting condition, or its local ver-sion.2:11/, will represent key assumptions for proving conver-gence results for our estimators in a general framework. How-ever, as far as identiŽcation is concerned, the relevant condition is the weaker Assumption 3.

We now illustrate the hierarchy of the identiŽcation hypotheses—the basic conditionµN[P0; ¸.µ0/]Dµ0, the unique Žxed-point property as well as the model identiŽcation condition—in the context of the standard backŽtting algorithm. 2.2 Classical BackŽtting Revisited

Consider the backŽtting algorithm identiŽcation (see, e.g., Hastie and Tibshirani 1990) in the context of a general partially parametric regression (PPR) model. Such a model is character-ized by a regression function split into two parts: a parametric part, possibly nonlinear, and a nonparametric part (see Andrews 1994a, p. 52). In general, one writes

8 < :

YtDh.X1t; µ /C¸.X2t/Cut; µ22½Rp; ¸2G E.utjXt/D0; XtD.X10t;X02t/0; tD1; : : : ;T;

withh.¢;¢/known and G an inŽnite-dimensional set of real-valued functions. For the sake of simplicity, in this example we assume that the processfut;Xt0gis stationary and ergodic. The quantities of interest, namely the true unknown valuesµ0and¸0

of the Euclidean parameterµ and the function¸, are deŽned as the unique minimizers of the mean squared error, that is, the unique maximizers of

Q1[µ ; ¸]D ¡E0[Yt¡h.X1t; µ /¡¸.X2t/]2:

The identiŽcation assumption expressed by this condition can be made even more precise by using the following decomposi-tion:

E[Yt¡h.X1t; µ /¡¸.X2t/]2

DE[Yt¡h.X1t; µ /¡E.Yt¡h.X1t; µ /jX2t/]2 CE£E¡Yt¡h.X1t; µ /jX2t

¢

¡¸.X2t/

¤2

: (2.12)

Indeed, because for any value of µ, one can deŽne a func-tion¸.¢/that renders the second term of the right side of.2:12/

equal to 0, the identiŽcation condition for the Euclidean para-meter amounts to

h.X1t; µ1/¡E0

¡

h.X1t; µ1/jX2t

¢

Dh.X1t; µ0/¡E0

¡

h.X1t; µ0/jX2t

¢

H) µ1Dµ0:


(8)

Given a guess¸.1/T ;a natural way to estimateµ is to consider the empirical counterpartµT.2/of

N

µ .P0; ¸/Darg max µ22¡E

0[Y

t¡h.X1t; µ /¡¸.X2t/]2; with ¸ replaced by ¸.1/T : With this estimate µT.2/, one can get an improved guess,¸.2/T , of the function ¸ by smoothing Yt¡h.X1t; µT.2//onX2t:We can continue this iterative smooth-ing process, thus obtainsmooth-ing an example of classical backŽttsmooth-ing. If it exists, the limit of the backŽtting algorithm can be explic-itly characterized as solution of the following system of equa-tions:

T DK

£

Y¡h.X1;bµT/

¤

;

T Darg min µ22

T

X

tD1

£

Yt¡h.X1t; µ /¡b¸T.X2t/

¤2

; (2.13)

where h.X1; µ / D .h.X11; µ /; : : : ;h.X1T; µ //0 and K is the smoothing matrix used to estimate the conditional expectation ofY¡h.X1; µ /givenX2:Such a Žxed point may not exist in Žnite samples, but may exist only asymptotically whenT goes to inŽnity. We address this issue later in our general theory (see Sec. 4) but is omitted here for the sake of expositional simplic-ity. HencebµTis obtained by solving the Žrst-order conditions

T

X

tD1

@h

@ µ.X1t;bµT/

£

Yt¡h.X1t;bµT/¡b¸T.X2t/

¤

D0;

which corresponds to the limit Žrst-order conditions E0

µ

@h

@µ.X1t; µ

.pC1//£Y

t¡h.X1t; µ.pC1//¡¸.p/.X2t/

¤¶

D0;

(2.14) where the function¸.p/.¢/is deŽned by

¸.p/.X2t/D¸.µ.p//.X2t/DE0

£

Yt¡h.X1t; µ.p//jX2t

¤

:

Therefore, the backŽtting algorithm will possibly provide a consistent estimator ofµ0 insofar as the following identiŽca-tion condiidentiŽca-tion is fulŽlled:

E0

µ

@h

@ µ.X1t; µ

1/¡Y

t¡h.X1t; µ1/ ¡E0[Yt¡h.X1t; µ1/jX2t]

¢¶

D0 H) µ1Dµ0:

When maintaining the assumption that the Žrst-order conditions characterize a unique maximizer, the backŽtting identiŽcation condition coincides with the unique Žxed-point condition of Assumption 3, where the functionµNis deŽned by

N

µ[P0; ¸.µ1/]Darg max µ ¡E

0[Y

t¡h.X1; µ /¡¸.µ1/.X2t/]2: Using the model equations

YtDh.X1t; µ0/C¸0.X2t/Cut and

E0.utjXt/D0;

we rewrite this backŽtting identiŽcation condition as:

Condition C1 (backŽtting identiŽcation): E0

µ

@h

@ µ.X1t; µ

1

h.X1t; µ1/¡E0[h.X1t; µ1/jX2t]

¢¶

DE0

µ

@h

@µ.X1t; µ

1/¡h.X

1t; µ0/¡E0[h.X1t; µ0/jX2t]

¢¶

H) µ1Dµ0:

As already noted, the PPR model identiŽcation condition can be written in the following way:

Condition C2 (identiŽcation in the PPR model): h.X1t; µ1/¡E0

¡

h.X1t; µ1/jX2t

¢

Dh.X1t; µ0/¡E0

¡

h.X1t; µ0/jX2t

¢

H) µ1Dµ0:

Finally, we note that the basic identiŽcation condition

N

µ[P0; ¸.µ0/]Dµ0

is tantamount to the following standard identiŽcation condition in the “latent” regression model:

Condition C3 (identiŽcation in the latent regression model): h.X1t; µ1/Dh.X1t; µ0/ H) µ1Dµ0:

It is clear that

C1 H) C2 H) C3:

As previously mentioned, the difŽculty in PPR models ap-pears when considering nonlinear regression functionsh. Ifh is linear (the case of the partially linear regression model), then the conditions C1 and C2 are equivalent and mean that the residual variance Var0.X1t¡E[X1tjX2t]/is positive deŽnite. C3 means thatE0[X1tX01t] is positive deŽnite.

Let us now study to what extent in the general PPR the back-Žtting identiŽcation condition C1 is more restrictive than the standard identiŽcation condition C2. DeŽne

’¡Xt;Yt; µ ; ¸.µ1/

¢

D@h.X1t; µ /

¡

Yt¡h.X1t; µ /¡E0[Yt¡h.X1t; µ1/jX2t]

¢

and assume that the functionµN[P0; ¸.¢/] can be also deŽned as the implicit solution of

E0£’¡Xt;Yt;µN[P0; ¸.µ1/]; ¸.µ1/

¢¤

D0; µ122 :

Assuming the needed regularity conditions, differentiate this identity with respect to µ1, take µ1 D µ0 (recall that

N

µ[P0; ¸.µ0/]Dµ0), and deduce that

@µN @µ10[P

0; ¸.µ0/]

DM¡1N; (2.15)

where

MD ¡E0

µ

@’0 @µ

¡

Xt;Yt; µ ; ¸.µ0/

¢­­­ ­

µDµ0

DE0

µ

@h

@µ.X1t; µ

0/@h

@µ0.X1t; µ

0/

(2.16)


(1)

The sample counterpart of ¸.µ / provides a root-T consistent estimator:

¸T.µ /D.X20X2/¡1X20.Y¡X1µ /

whereas the sample counterpart ofQ1[µ ; ¸.µ1/] is QT[µ ; ¸.µ1/]D ¡

1 T

T X

tD1

[Yt¡X1t0 µ¡X02t¸.µ 1/]2:

It is mentioned just after (2.9): “Note that for the sake of notational simplicity we always denote byQT[µ ; ¸.µ1/] the

Ž-nite sample criterion while, in some cases, ¸.µ1/ should be replaced by a consistent estimator¸T.µ1/. In other words, by

a slight abuse, all the data dependence is summarized by the notationQT.” This abuse of notation is innocuous insofar as

it does not conceal the restrictive content of the assumptions maintained about the asymptotic behavior of the functionQT.

These assumptions are:

Assumption 1(for consistency of the backŽtting estimator).

Uniform convergence for µ ; µ12 2 of QT[µ ; ¸.µ1/] toward

Q1[µ ; ¸.µ1/].

This basically means two things: (1) uniform convergencefor allµ ; ¸ of a sample counterpart ofQ1[µ ; ¸] and (2) uniform convergence for allµ1of¸T.µ1/toward¸.µ1/.

Assumption 5(for deriving the root-Tasymptotic probability

distribution of the backŽtting estimator). Root-T asymptotic normality of@QT=@µ[µ ; ¸.µ±/]jµDµ±.

This comes from two primitive assumptions: (1) root-T as-ymptotic normality of a sample counterpart of @Q1=@ µ[µ ; ¸.µ±/]jµDµ± and (2) root-Tasymptotic normality of the estima-tor¸T.µ±/of¸.µ±/.

Of course, the expression of the matrixB.µ±/must take into account these two sources of randomness whenever¸.µ±/must be estimated. This point can be easily checked in the Frisch– Waugh example.

b. In the classical backŽtting example, the nonparametric part of the model will introduce asymptotic normality with non-parametric rates that are beyond the scope of the article. This is the reason that Section 4.1 about consistency of the backŽtting estimator states “for a general parameter space” (see Prop. 4.2), whereas Section 4.2 about asymptotic probability distributions focuses on a parameter space “assumed to be a subset of some Euclidean spaceRp.” In other words, the classical backŽtting example must be considered for identiŽcation and consistency issues, but not for asymptotic probability distributions. It plays a crucial role in the article, however, because it has motivated the backŽtting terminology.

In this example, when the nuisance part of the regression model is really nonparametric, ¸.µ / is a functional deŽned by (2.14):

¸.µ /.X2t/DE[Yt¡h.X1t; µ /jX2t]:

Of course, this functional should be estimated with a non-parametric rate of convergence from a smootherK as deŽned in (2.13):

¸TDK[Y¡h.X1; µT/]:

X. Chen’s words of caution regarding the additional issues that should be addressed for extending the article to functional es-timation are appreciated. The idea of recursive nonparametric learning, following Chen and White (1998), is also appealing.

REPLY TO M. CHERNOV

M. Chernov’s discussion interestingly focuses on the concept of nuisance parameters and gives us the opportunity to clar-ify some notational issues that seem to be responsible for the trouble of some discussants (see also the replies to Q. Dai and J. Pan). Just to give a funny example of the difŽculty of remain-ing true to constant notation when several different issues are addressed in the same article, we note that M. Chernov starts his discussion with a notation that contradicts that adopted in the article: “Pastorello, Patilea, and Renault (PPR henceforth),” whereas we write in Section 2.2, “a general partially paramet-ric regression model (PPR hereafter)”; X. Chen’s discussion re-mains true to our notation.

More seriously, M. Chernov’s introductory reference to what he calls the “fundamental theorem of asset pricing” [his for-mula (1)] allows him to classify the issues of interest but, unfortunately, with some misleading notation. We actually write in the Introduction that “the joint conditional probability distribution of.mtC1;gtC1/given: : :the relevant conditioning

informationIt” involves the “value of a vector¸ofqunknown

parameters” in such a way that the fundamental theorem of as-set pricing:

YtDE[mtC1¢GtC1jIt]

is encapsulated in a pricing relationship: YtDg.Yt¤; ¸/;

where Yt is a vector of “J discretely observed asset prices,”

whereas “Yt¤is a vector ofpossiblylatent state variables which summarize the conditioninginformation.” Moreover, the proba-bility distribution of the variablesYt¤is assumed to be described by a vectorµ ofpunknown parameters in such a way that¸ ap-pears to be a given function ofµ:

¸D¸.µ /: It is important to realize at this stage that:

1. The variablesYt¤ are “possiblylatent,” but nothing pre-cludes that some of them are observed (as is the case in Sec. 6) to ensure identiŽcation.

2. Because the variables of interest include a pricing kernel, nothing precludes that some components of the vectorµ

ofstatisticalparameters are related to preferences, in

par-ticular, to the risk premium. Such preference parameters are actually relevant statistical characteristics of the pric-ing kernel dynamics.

Indeed, this convention is followed in the application of Sec-tion 6 where it is announced, from the beginning, that although “the pricing kernel is not explicitly included in the list of state variables: : :the dynamics of the latent risk factorsXtonly

iden-tify a setµ1 of unknownstatisticalparameters while the risk

premium parametersµ2must be added to deŽne the complete


(2)

vectorµ1of structural parameters of interest for asset pricing:

µD10; µ20]0.”

Thus, preference (risk premium) parameters are deŽnitely not considered as nuisance parameters, but as part of the vec-torµ of structural parameters of interest. The unfortunate con-fusion of some readers seems to come from the conict between two common but contradictory notational uses:

1. ¸is a standard notation in statistical theory for nuisance parameters. This is the reason that, in the spirit of back-Žtting, we write Yt Dg.Yt¤; ¸/ and, as a consequence,

QT[µ ; ¸]. Nonadaptivity in statistical estimation is

pro-duced by some perverse interactions between structural parametersµ and nuisance parameters¸, which precludes consistent estimation of the former when no consistent es-timator of the latter is available. BackŽtting, as advocated in the article, appears to be well suited when nuisance rameters are deŽned as a known function of structural pa-rameters. In the Žnancial applications we have in mind, we actually have¸.µ /Dµ or part ofµ. Therefore, the dis-tinction between parameters of interest and nuisance pa-rameters does not come from their actual value but from their role inside the criterion function used for M estima-tion. In this respect, it is true that “PPR simply merge¸p

and µ,” but it is not true that “this clearly removes the nonadaptivity problem.” We hoped that the introductory examples would help the reader to realize that thesame unknown parameter value could be simultaneously a nui-sanceµ1[e.g., to recoverYt¤.µ1/in the latent regression example] and an object of interest [e.g., estimated by non-linear least squares ofYt¤.µ1/onXt].

2. Unfortunately,and we apologizefor this, we have also fol-lowed for the speciŽc term structure example of Section 6 another notational use. This use is also common but more in the Žnance literature: The risk premium parameter is denoted by¸. This is the reason that we complete the de-Žnitions (6.15) and (6.16) of the models of interest with the notation:

µ1D.k;c; ¾ /0 and µ2D¸:

We did not realize that this would introduce some confu-sion with the general notation¸.¢/for the function deŽn-ing nuisance parameters because throughout Section 6 this function is nothing but the identity function:

¸.µ /DµD10; µ20]0;

as is apparent from the beginning [see (6.1) and (6.2)]. Note that the measurement equation actually depends only on a not one-to-one function ofµdeŽned by the risk-neutral parameters, but there is no need to introduce a spe-ciŽc notation for this; this is implicitly encapsulated in the deŽnition of functionhin (6.1).

We hope that this reply will now make clear that it is not fair to say that “at the implementation stage, PPR are confronted with two problems, which were not addressed in their theoreti-cal developments.”

First, about blurring “the separation between statistical and risk-premium parameters raising the question of whether such

a separation was necessary in the Žrst place,” the answer is that this separation was simply not made in the Žrst place.

Second, it is right to say that “the nuisance parameter¸nhas

no explicit dependence onµ” or, more precisely, that there is no adaptivity problem with respect to this parameter, called ´in Section 6 of the article. But the case of such a nuisance parame-ter without an adaptivity problem can easily be accommodated within the general framework. When´ is replaced by a sam-ple counterpart´T, this deŽnes a criterionQT that converges

toward some criterionQ1. Because there is no adaptivity prob-lem, the resulting estimatorµT of structural parametersµ will

be consistent, irrespective of the value of´1DPlim´T. Of

courseQ1 and in turn the asymptotic variance ofµT will

de-pend on´1. This is the reason that, insofar there is some hope that the nuisance part of the model is correctly speciŽed with a true unknown value´±, it may make sense to resort to what we call “extended backŽting.” The idea is to introduce within the backŽtting algorithm a avor of quasi-generalized least squares to ensure´1D´±. But, contrarily to what the discussant sug-gests, we are not worried by the fact that the “convergence cri-terion has poor properties in¸n.” Even if these properties were

good, we would likely prefer extended backŽtting because there is no reason to include in a backŽtting algorithm some parame-ters that do not raise any adaptivity issue.

Let us just briey answer two additional interesting questions of M. Chernov’s discussion. “The backŽtting procedure effec-tively allows to operate on the likelihood of the state variables rather than observables, which allows to avoid computation of the Jacobian” andalso to avoid maximizationwith respect to the value of the implied states. The latter point should not be for-gotten because, as mentioned in the article, this allows us, for instance, to use Aït-Sahalia’s likelihood expansions, whereas they diverge when one considers that implied states depend on unknown parameter values. We acknowledge that the Žrst point may be a matter of taste [the discussant considers that “once one hasg.¢/, computing Jacobian is not difŽcult”] but we ob-serve that, for instance, Pan (2003) has also renounced comput-ing some Jacobians needed to deŽne the optimal instruments. It might be worth remembering that these Jacobians should be involved in a sophisticated optimization procedure (and not just computed once) and that a proxy of the derivative of the func-tiongis much less stable than the proxy of the functiongitself. We do like the discussant’s remark about the continuously updated GMM of Hansen, Heaton, and Yaron (1996). Because it is an excellent illustration of cases without an adaptivity prob-lem, it was actually an introductory example in some trans-parencies used to present this article in various seminars. Even though we Žnally chose not to include it in the article for length reasons, we are glad to briey summarize it in this reply. Let us consider the following asymptotic criterion function in the context of GMM with standard assumptions:

Q1[µ ; ¸.µ1/]D ¡E[Ã .Yt; µ /0]W[¸.µ1/]E[Ã .Yt; µ /];

where W[¸.µ1/] is deŽned as the optimal weighting matrix whenµ1is the true unknown value of the parameters. Then sev-eral questions arise and should not be confused:

Question 1. Is there an adaptivity problem? In other words,

is it the case that the use of a wrong value of µ1 may pre-vent us from getting a consistent estimator of µ? The answer


(3)

is “no.” One way to see this is to check that the cross-derivative @2Q1=@ µ @ µ1is 0 whenµ Dµ±irrespective of the value ofµ1.

Question 2. However,µ1is “a nuisance parameter because it

determines the distribution of the estimates.” This is the reason that people use to doiterated GMM(in this respect, a particu-lar case of backŽtting) to be able to replaceµ1by a consistent estimator of µ±, which in turn ensures getting an asymptoti-cally efŽcient estimator ofµ. But note that, because there is no adaptivity problem, there is no need to iterate a large number of times as for common backŽtting; a simpletwo-step GMM esti-mator is asymptotically efŽcient and asymptotically equivalent to the oracle estimator that would be obtained withµDµ±.

Question 3. Is it legitimate to maximize directly with respect

to µ the objective functionQ1[µ ; ¸.µ /]? The answer is “yes” and this actually corresponds tocontinuously updated GMMas proposed by Hansen et al. (1996). The reason it works is a prop-erty even stronger than adaptivity:The derivative@Q1=@ µ1is 0 whenµDµ±irrespective of the value ofµ1.

REPLY TO Q. DAI

Q. Dai presents an interesting decomposition of the like-lihood, which is relevant for understanding the informational content of the different blocks. However, as already explained in our reply to M. Chernov, if3is a “vector of parameters typ-ically associated with the market prices of risk,” it is deŽnitely not to “take a short cut” to consider withour notationthat “there exists a deterministic relationship between3andµ.”

Once more, we consider for all Žnancial applications that our¸.µ /, corresponding to (1.2)/(1.3) and not only to “market prices of risk,” is such that¸.µ /Dµ D.µ1; µ2/, whereµ2, the

vector of preference parameters, corresponds to what Q. Dai de-notes by3. With such notation, and without any “short cut,” it is true that3is a known deterministic function ofµ. We apol-ogize if, by saying in the Introduction that¸ is the vector of “pricing parameters,” we have been responsible for some con-fusion. We used this terminology to stress that the knowledge of¸is needed to compute the current asset prices from the cur-rent value of the state variables. If you think, for instance, of a Heston kind of option pricing formula with stochastic volatil-ity, the option price is a known function of the underlying spot volatility, but this function depends on all the risk-neutral para-meters and not only on the market price of risk.

In the application of Section 6, the identiŽcation restriction forµ2 is that some yields are observed without measurement

errors. With respect to Q. Dai’s comment on these identiŽ-cation restrictions (see, in particular, his footnote 2), several remarks are in order. We thank him for stressing that such re-strictions are crucial in identifying some parameters. However, we would not say “zero-mean restriction” but rather “no mea-surement error” assumption, because the meamea-surement errors are in any case with zero mean. If a measurement error might have a nonzero mean, nothing could be done without addi-tional structure. Moreover, it is worth remembering that one can imagine some identifying restrictions more general than the “no measurement error” one. First, one can consider a full vector of measurement errors but with a singular covariance matrix. Sec-ond, even without singularity, the mere fact that measurement

errors are assumed to be i.i.d. paves the way for dynamic iden-tiŽcation, as a Kalman Žlter does. Finally, one can even imag-ine some speciŽc dynamic structure on the measurement errors. This should be an even more favorable situation to enhance the advantages of latent backŽtting.

REPLY TO G. DURHAM AND J. GEWEKE

We gladly welcome Durham and Geweke’s comments, which, through some neat geometric interpretations, have im-proved our own understanding of the issue of nonadaptivityand of the backŽtting kind of solution to this issue as well.

Let us Žrst focus on the speciŽc issue of implied states GMM, corresponding to a criterion function:

QT[µ ; ¸]D ¡ "

1 T

T X tD1

’.Yt; µ ; ¸/ #0

WT

"

1 T

T X tD1

’.Yt; µ ; ¸/ #

: When Pan (2002) considered implied states GMM, the uncondi-tional moment restrictions’.Yt; µ ; ¸/were built with “optimal”

instruments and were just identiŽed by construction. Hence, for all practical purposes, we can assume that, for any given¸, the maximum ofQT[µ ; ¸] is the value 0;µT.¸/is deŽned as a

solu-tion of

1 T

T X

tD1

’¡Yt; µT.¸/; ¸ ¢

D0:

Whereas the IS-GMM of Pan (2002) estimatesµ by the solution µTISof

1 T

T X tD1

’¡Yt; µTIS; ¸.µTIS/ ¢

D0;

implied states backŽtting estimatesµ by the limitµTof the

iter-ation:

1 T

T X tD1

’¡Yt; µ.pC1/; ¸.µ.p// ¢

D0:

Therefore, we maintain that “typically, if the functionµT.¸.¢//

admits a unique Žxed point, the IS-GMM estimator coincides with the limit of iterations.” We do not understand the discus-sants’ point “if that were the case then these estimators would have the same asymptotic variance, but the article demonstrates that these variances are different.” This point may actually be well taken,not for the speciŽc case of IS-GMM but for the general issue of comparison between implied states backŽtting and the one-step estimator (see the second part of this reply): Arg maxµQT[µ ; ¸.µ /].

In the particular case of ( just identiŽed) IS-GMM, this dis-tinction is irrelevant because, with respect to the discussants’ geometric interpretation, the fact that

8¸; Max

µ QT[µ ; ¸]DQT[µ .¸/; ¸]D0

implies that, for any horizontal line, tangency is obtained with the same level curve. This is not consistent with the case por-trayed in the discussants’ Figure 1.

Although this case has not been studied in Pan’s (2002) ac-count of IS-GMM, the article also sheds some light on the


(4)

case with overidentiŽcation.It is explicitly acknowledged,how-ever, that this issue is only addressed under the restrictive As-sumption 4. Owing to this asAs-sumption, it is actually possible to address the discussants’ worries about the comparison of vari-ances between IS-GMM and backŽtting, respectively. Whereas the former will solve

Max µ ¡

"

1 T

T X tD1

’¡Yt; µ ; ¸.µ / ¢#0

WT

"

1 T

T X

tD1

’¡Yt; µ ; ¸.µ / ¢#

; the latter will solve

Max µ ¡

"

1 T

T X tD1

’¡Yt; µ ; ¸.µ.p.T/// ¢#0

¢WT "

1 T

T X

tD1

’¡Yt; µ ; ¸.µ.p.T/// ¢#

: Therefore, it is clear that, even for largeT andp.T/, these two estimators correspond to two different selections of the estimat-ing equations and are not numerically equal in general (in con-trast to the just identiŽed case). However, it is worth noticing that the consequence (3.5) of Assumption 4 in the article pre-cisely states that, asymptotically, the two procedures will select (see the Žrst-order conditions) the same linear combinations of estimating equations. In other words, the two estimators, IS-GMM and IS-backŽtting, remain asymptotically equivalent in the overidentiŽed case. We thank the discussants for giving us the opportunity to make this point more explicit than it was in the article.

One way to understand the discussants’ point about the pos-sible difference, even asymptotically, between what they de-noteµTISand the backŽtting estimatorµT, is to imagine that the

discussants use the notationµTIS, as well as the terminology im-plied states GMM, for the general estimation principle:

µTISDArg Max

µ QT[µ ; ¸.µ /]:

By contrast, we have reserved in the article the terminol-ogy “implied states GMM” for Pan’s GMM context, and this is the reason we have claimed that it typically coin-cides with the backŽtting estimator for an inŽnite number of iterations. But the discussants are certainly right to claim thatµTISDArg MaxµQT[µ ; ¸.µ /] and the backŽtting estimator

µTDArg MaxµQT[µ ; ¸.µ.p.T///] are, in general, different. This

is actually the main focus of interest of this article. For instance, in the likelihood context, depending on the way the functionQ is deŽned,µTIS may be either the maximum likelihood estima-tor or, if the Jacobian part is not taken into account, a wrong estimator that does not converge in general toward the true un-known value ofµ. This is the reason that, if we agree to say that (with discussants’ notation) there exist some situations where “differences betweenµTISandµT could be extremely great,” we

do not think that this is necessarilyµT that “could be

mislead-ing.”

The discussants’ geometric interpretation illustrates the dif-ference. It also shows, in a way familiar to economists, that the maintained contraction mapping assumption is akin to a com-parison of the slopes of the functions¸.µ /andµ .¸/. To say that the slope of the latter must be larger than the slope of the for-mer is a neat way to explain what we pointed out in the article

as a comment of (4.6): “the occurrence ofµ in the latent model (the transition equation of the state variables) carries more in-formation about the unknown parameters of interest than their occurrence in the measurement equation.”

Of course, when the functions¸.µ / andµ .¸/are nonlinear, the contraction mapping property is only local, and this may create some pitfalls for backŽtting. It is actually a general is-sue for the method of successive approximations: Its reliability highly depends on the choice of the starting point. As usual, it may be relevant to hedge against this risk by choosing as a starting point a value ofµ suggested by some simpler estima-tion procedure. For instance, in the spirit of Pastorello, Renault, and Touzi (2000), Black–Scholes implied that volatilities and associated estimators of the parameters of the volatility process may provide a useful starting point for backŽtting estimation of option pricing models with stochastic volatility.

REPLY TO M. JOHANNES AND N. POLSON

We are grateful to M. Johannes and N. Polson for having carefully read our article and made an insightful account of the inference issue, which is our main focus of interest. It is actually quite comforting to see that our distinction between two subsets of parameters, cleverly denoted by£Pand£Qin their discus-sion (whereas we use in the article the less transparent notation µ1 andµ2/, has been well understood. It may be that reading

this discussion is the best way to clarify some discussants’ con-fusion between “pricing parameters”¸.µ /Dµ(or part of ) with µD.µ1; µ2/and market prices of riskµ2.

Indeed, we agree with most of Johannes and Polson’s dis-cussion. We do think that our “approach is similar in spirit to MCMC algorithms,” not only because “it is an iterative algo-rithm” but also more deeply because the mathematical founda-tions of the convergence argument for the iterative procedures are often very similar. To see this more clearly, it may be better to consider a class of models where, by contrast with most as-set pricing applications, the conditional probability distribution ofY¤givenYandµis not degenerate. Let us consider the latent regression model example of Section 2.3. of the article. Then we claim that the iterative algorithm suggested by the “general-ized residuals” of Gourieroux, Monfort, Renault, and Trognon (1987) (see also the “simulated residuals” article in the same is-sue ofJournal of Econometrics) is quite similar in spirit to the MCMC approach to PROBIT and TOBIT kinds of models.

It is also a well-taken point to stress that the two approaches are “different along a number of dimensions.” First, it is per-fectly right to say that, when updating the structural parameters, MCMC uses the information in both the vector of states and all of the observed asset prices, whereas implied states backŽtting focuses on the transition equation of state variables, and basi-cally in the M stage, neither option prices or yields are directly used. The stochastic volatility model considered by the discus-sants is illuminating in this respect. However, we think that, at least in some circumstances, this loss in efŽciency may be ac-ceptable for several reasons.

First, as usual in statistics, we may imagine a trade-off in efŽciency versus robustness to misspeciŽcation. Second, we have performed in Pastorello et al. (2000) some Monte Carlo


(5)

comparisons between the informational content about volatil-ity structural parameters of Black–Scholes implied volatilities and underlying asset price series, respectively. Because, on the one hand, the estimation procedure based on Black–Scholes im-plied volatilities was entirely based on the volatility transition equation and, on the other hand, the estimators based on re-turns time series were 10 times less accurate than the volatility-based ones, we wonder whether there is so much information to lose in focusing on the transition equation of the volatility process. The term structure example in this article leads to the same type of conclusion. Finally, if we do think as the discus-sants that “the real challenge: : :is to provide computationally attractive tools that solves problems of interest for empirical as-set pricers,” we hope that the backŽtting approach is in some respect closer to the practitioner’s approach of computing im-plied parameters and, for this reason, a bit more user friendly in terms of interpretation than MCMC. This may justify paying a cost in terms of relative efŽciency.

It is also well taken to notice that the backŽtting approach “does not appear to be as general as MCMC in terms of esti-mating latent states.” Indeed, the discussants raise an interest-ing issue about jumps that are just “integrated out of the option price.” It appears at Žrst sight that this issue is even more gen-eral and is about any factor that appears as a sequence of inde-pendent shocks rather than through a continuous-time Markov structure. We have no complete answer at this stage for this in-triguing question.

REPLY TO J. PAN

J. Pan introduces her interesting set of comments as revolv-ing “around one very basic question: as a practicrevolv-ing econome-trician, under what situation will I choose the proposedlatent

backŽtting methodover existing approaches?” We do think that

this is a nice way to address the issue but under one condi-tion: One should start from a clear deŽnition of what the pro-posed backŽtting is and what the existing approaches were be-foresuch a backŽtting strategy had been proposed, Žrst in the likelihood context of Renault and Touzi (1996) and second in the very general context of extremum estimators of Patilea and Renault (1997). In this respect, we will argue in this reply that, rather than saying “the idea of IS-GMM comes from the obser-vation that for any given set of µ and¸, we can always try to invertXtfromYt using the pricing relation,” it would be more

appropriate to say that IS-GMM interestingly specializes to a GMM context the general idea of latent backŽtting that is pre-cisely “the observation that for any given set ofµ1andµ2, we

can always try to invertXtfromYtusing the pricing relation.”

Note, as a preliminary remark, that we have changed the discussant’s notation of the unknown parameters, to avoid the confusion, already mentioned in replies to Chernov and Dai, be-tween nuisance parameters and market prices of risk. We found it quite puzzling in this respect that J. Pan considers the fact that the set of parameters¸is a function ofµ “excludes most of the motivating examples discussed in the paper.” As already explained in the previous replies, most of the motivating ex-amples in asset pricing, including Pan’s (2002) option pricing model, are such that the set of parameters¸, as they are deŽned in the main part of the article, starting with formula (1.2) in the

Introduction,coincideswithµor at least with the function¸.µ / deŽning the risk-neutral parameters. In this respect, the discus-sant is right “to give the authors the beneŽt of doubt and assume that theirµ is big enough to include both” what she calls “theµ and¸” and what we prefer to call “theµ1 andµ2.” Indeed, we

thought it was clear that in most asset pricing models, the re-lation between observed prices and state variables depends not only on the market prices of risk but also on the dynamics of the state variables.

The specializationof latent backŽtting, as put forward by Pan (2002), to a GMM context is actually illuminating for two rea-sons. First, as explained in the reply to Durham and Geweke, GMM is a speciŽc case where, by contrast with the general case, it is not wrong to perform estimation in one step as

µTISDArg Max

µ QT[µ ; ¸.µ /] instead of the backŽtting estimator

µTDArg Max

µ QT

£

µ ; ¸.µ.p.T///¤: This is the reason that Pan (2002) proposes to solve

1 T

T X tD1

’¡Yt; µTIS; ¸.µ IS T /

¢ D0;

whereas implied states backŽtting in this speciŽc context pro-poses more generally to estimateµ byµT solution of

1 T

T X

tD1

’¡Yt; µT; ¸.µ.p.T/// ¢

D0

for a sufŽciently large numberp.T/of iterations. In this respect, a constructive discussion should not oppose the two strategies but, on the contrary, acknowledge that the former is a particular case of the latter.

The second reason the specialization to GMM is illuminat-ing is that there is, however, one speciŽc situation where “the latent backŽtting method does not converge to IS-GMM, which does not suffer from the identiŽcation problem ofµ2,” precisely

because it follows a one-step approach. More precisely, when latent moment conditions

E[Ã .Yt¤; µ1/]D0

are only able to identifyµ1, it may be the case that, after

com-putation of implied states, the resulting moment conditions: E£Ã¡g¡1.Yt; µ /; µ1

¢¤ D0

are able to identify not onlyµ1but alsoµ2such thatµD.µ1; µ2/.

Pan has to be congratulated for pointing out this special case in her discussion. In such a setting,µ2 typically corresponds

to market prices of risk and implied states backŽtting will not work becauseµT.µ /will not depend onµ2. This is the reason

we have written that it is only when “the functionµT.¸.¢//

ad-mits a unique Žxed point” that we are sure that “the IS-GMM estimator coincides with the limit of iterations” of backŽtting. Note, however, that the special case whereµT.µ / does not

de-pend onµ2and, therefore, does not admit a unique Žxed point

should not be overstated. The term structure application, which was performed in Section 6, shows that there are natural cir-cumstances where even the market prices of risk are identiŽed


(6)

by the latent moment conditions because some observed prices are included in the list of state variables.

To conclude, we do not consider that it is appropriate to oppose IS-GMM and IS-backŽtting. The former is basically a neat specialization of the latter in a GMM context. It is al-ways possible to interpret IS-GMM as looking for the Žxed point IS-backŽtting is tracking by successive approximations. Of course, one can always imagine situations where the Žxed point exists but backŽtting will fail because the function is not contracting. But we would surely not say “the desired Žxed point is ever so eluding when no much pressure is en-forced.” As explained in the article [see formula (3.6) and re-lated comments], when Pan (2002) says “the efŽciency loss of this optimal-instrument scheme is limited in that: : :,” the only

limitto this efŽciency loss is actually characterized by the

al-legedly eluding contracting feature of the backŽtting function.

REPLY TO R. P. SHERMAN

We are glad to welcome R. Sherman’s comment. In an inde-pendent and simultaneous work, Dominitz and Sherman con-sidered a closely related framework and provided some deep insights for the case of semiparametric single-index models for binary dependent variables. Their contribution is closer to the original EM paradigm. At each iteration they perform a gen-uine E step; that is, an expectation of the sample criterion over unobserved variables given observed variables and current pa-rameter values is computed. In this respect, our framework is more general because it does not specify a particular way to re-cover a guess about the latent variables for a given value of the parameters. In particular, in many applications such as empir-ical Žnance and applied game theory, the relationship between latent and observable worlds is one to one. This case has some speciŽc features (identiŽcation, information: : :), which are the main focus of interest of our article.

Imposing conditions for asymptotic results on sample crite-rionQT and its Žrst- and second-order derivatives, rather than

on the functionµT(in general, implicit) and its Žrst derivative is

a matter of convenience given the application at hand. In some cases the former is easier, whereas in other cases the latter is more appealing.

Moreover, writing our conditions in terms of QT entails

that we distinguish between what is needed for identiŽcation (unique Žxed point) for consistency of the proposed estima-tor (contraction mapping) and for the asymptotic distributional theory [k6.µ±; µ±/¡1H.µ±/k<1]. The latter inequality allows us to draw a direct relationship with the statistical literature on nonadaptivity. It allows us to compare directly the asymptotic properties of various competing estimators (see, e.g., the previ-ous discusion aboutcontinuously updated GMMor our discus-sion in the article aboutimplied states GMM), which are not, in general, deŽned through the functionµT.

REPLY TO C. A. SIMS

We are grateful to C. Sims for the set of comments of our work he proposes under the title “the paper’s ideas are not in-herently non-Bayesian and not an alternative to Bayesian ap-proaches.” It was quite exciting for us to see that our approach could be reconsidered with a Bayesian viewpoint and seems to deal with a type of modeling situation in which standard Gibbs sampling might be inefŽcient. The tone of Sims’ comments is more critical as regards the numerical issues and the term struc-ture application.

As already acknowledged in our reply to Chernov’s com-ments, it is always partially a matter of taste to decide if some amount of computational work is a daunting task or not, and to what extent one is ready to accept some efŽciency loss to look for more user-friendly procedures. Let us just stress in this re-spect that when resorting to latent likelihood, owing to backŽt-ting, allows one to work with computations available in closed form, there is a beneŽt, at least in terms of interpretation. More-over, as it is manifest with the example of Aït-Sahalia’s likeli-hood expansions, there is a cost to maximize a likelilikeli-hood that involves “unnatural” occurrences of the unknown parameters.

About the term structure application, we actually share most of the discussant’s sources of discomfort. But we do think that this “applies to all factor pricing model of this sort, not just to this paper.” For instance, we acknowledge in the article that the results are sensitive to whichK variables or linear combi-nations of variables are assumed to be measured without er-ror. However, we do know that this problem would be at least as detrimental with standard maximum likelihood. Moreover, the idea of backŽtting could accommodate an iterative search for the “least principal components.” Last but not least, the partially semiparametric feature of what we called “extended backŽtting” should make it robust, by contrast with standard maximum likelihood, with respect to some speciŽcation error in the disturbance process. Unless we are missing something important, we do think that this argument makes sense, even without Monte Carlo evidence.

Finally, we fully agree that more work should be done to say “how is it expected that a Žnancial decision maker should use these models,” if the “error” has “to be interpreted as reect-ing failure of no-arbitrage conditions.” Renault (1997) devoted a complete invited lecture at the 1995 World Congress of the Econometric Society to “Econometric Models of Option Pric-ing Errors,” precisely because we felt uncomfortable with the way arbitrage-based option pricing models were, at the time of this lecture, poorly accommodated to statistical inference. Of course, Žnancial econometrics have made some signiŽcant progress within the last 10 years but the way estimation errors and model uncertainty should be taken into account for Žnan-cial decision making is still a wide-open issue.