is the sum of an integrated component with drift and an occasional large transitory compo- nent will generate data in finite samples that is difficult to distinguish from a trend stationary
process with a trend break. They report that purely transitory disturbances can cause rejection of the unit root null hypothesis and spurious results indicating that there has been a
permanent shift in the level.
Third, it is also the case that outliers Franses and Haldrup, 1994 or structural breaks Leybourne, Mills and Newbold, 1998 can induce spurious rejections by Dickey-Fuller
tests. This issue was raised by Nelson and Murray 2000 in discussing tests for unit roots in RGNP. Lengthening the span of data increases the possibility of major structural changes.
Failure to account for such structural changes will bias unit root tests. For example, Leybourne, Mills and Newbold 1998 found results converse to those of Perron 1989.
They show that when a series is generated by a process that is I1, with an early but not late break, application of the Dickey-Fuller test will lead to spurious rejection of the unit root null
hypothesis. These results suggest that the application of the Dickey-Fuller tests will yield misleading results when applied to data that have a structural break early in the series. One
important implication of these findings is that studies that justify using longer spans of data on the grounds of increasing the power of the tests and more precise estimates, can lead to
erroneous findings if the early data contains a structural break.
In our view, statistical tests of unit roots will almost inevitably be based explicitly or implicitly on models that are drastic simplifications of the underlying generating process. It
is simply impossible, given available data, to simultaneously contemplate models that are more highly parameterized than the typical parsimonious forms, including possible moving
average terms, and allow for the possibility of both outliers and structural breaks of unknown form and type, occurring at unknown points in time. Whether the simplified structures on
which the tests are necessarily based are helpful or misleading is difficult to anticipate in any application.
3. US real GNP, 1875–1993
In the remainder of this paper we present statistical evidence against the TS hypothesis, while accepting that this does not imply DS as a likely alternative. To allow comparison with
earlier studies, we employ the same series as Diebold and Senhadji 1996 in our empirical analyses. These authors used four series, GNP-BG, GNP-R, GNP-BGPC, GNP-RPC based
on whether measures from Balke and Gordon 1989 or Romer 1989 were used or whether RGNP was expressed in per capita form. In all cases the logarithmic transformation was
applied.
Diebold and Senhadji 1996 showed that Dickey-Fuller tests applied to the four RGNP series over the whole period 1875–1993 yielded very strong rejections of the null hypothesis
of difference-stationarity. As mentioned above, Schwert 1989 documented severe size distortions of the Dickey-Fuller test for ARIMA0,1,1 processes with large MA coeffi-
cients.
8
Could the rejection of null of DS for RGNP by Diebold and Senhadji 1996 and others, have been spuriously generated by a substantial moving average component? Despite
the fact that RGNP most likely does not have a large MA root, we explore the possibility,
88 P. Newbold et al. Journal of Economics and Business 53 2001 85–102
by considering autoregressive-moving average, ARMAp,q models as possible generators of the series of first differences.
If y
t
denotes the logarithm of RGNP, these models are of the form ~1 2 f
1
L 2 . . . 2 f
p
L
p
~Dy
t
2 b 5 ~
1 2 u
1
L 2 . . . 2 u
q
L
q
e
t
1 where L is the lag operator and e
t
zero-mean white noise. To ensure replicability of our results, the orders p,q were chosen through the Schwarz Bayesian Criterion SBC, rather
than judgmental methods. This criterion is known to yield consistent estimators of p,q when the true model is in the contemplated set Hannan, 1982. The parameters of 1 were
estimated through full maximum likelihood, using a GAUSS subroutine, and all combina- tions of orders with p 1 q 7 were estimated. For all four time series, SBC selected a model
whose estimated parameters implied a unit moving average root, indicating overdifferencing in the model 1.
9
These results reinforce the Diebold-Senhadji evidence against difference-stationarity, and moreover do so with no prior assumption of a pure autoregressive generating model.
However, our evidence is not entirely conclusive, as it is well known [Cryer and Ledolter 1981, Shephard and Harvey 1990, and Davis and Dunsmuir 1996] that maximum
likelihood estimates can fall on the boundary of the invertibility region in the absence of a unit moving average root. Nevertheless, it is difficult to see how an investigator can proceed
with a DS model of the form 1 in these circumstances. If that investigator is committed to linear models with fixed parameters, the next step would be to consider stationarity around
a linear trend. We applied the test of Leybourne and McCabe 1994, 1996, in which the null hypothesis is TS and the alternative is DS. In no case was TS rejected at the usual
significance levels [see Cheung and Chinn 1997, for a similar finding over almost the same period].
The apparently strong evidence of TS over a period of 119 years is somewhat perplexing, as many analyses using post-World War II quarterly RGNP have supported a DS model.
Moreover, Dickey-Fuller tests applied to the annual series over the period 1950 –1993 failed to yield strong evidence against DS.
10
The test statistics are the t-ratios associated with the least squares estimate of g in the regression
Dy
t
5 a 1 bt 1 gy
t2 1
1
O
j5 1
k
d
j
Dy
t2j
1 e
t
2 In line with the recommendation of Ng and Perron 1995, we chose k in 2 through
general-to-specific testing at the 10-level, with a maximum possible value of 5. The resulting test statistics are 21.993 for RGNP and 22.516 for RGNP per capita. Neither is
significant at the 10-level. Nelson and Murray 1997 present further tests which, on balance, failed to provide strong evidence of trend-stationarity in the postwar period.
One possible reaction to these results is that tests based on the subperiod will have considerably less power than tests applied to the full sample. That is, a longer span of data
may allow one to detect trend reversion more readily.
11
A second possibility is that there could have been a structural change over time in the process generating the data over the
89 P. Newbold et al. Journal of Economics and Business 53 2001 85–102
longer period.
12
The true DGP may have changed between the postwar and prewar periods. A third possibility is that the existence of structural breaks leads to spurious results.
To explore these possibilities, and assist in sorting out some of the complexities mentioned above when attempting to test DS against TS specifications, we develop a new diagnostic
methodology. First, we applied the Dickey-Fuller test lag 5 5, implemented precisely as described in the previous paragraph, to all blocks of 44 consecutive observations, beginning
with 1875–1918, and ending with 1950 –1993, giving 76 Dickey-Fuller statistics. The results for the four RGNP series are graphed in Fig. 1. Two notable features of these graphs are: i
two periods of extreme volatility and ii a very wide range for the test statistics. Intuition would suggest that patterns of this kind are very unlikely to be found for trend stationary
generating processes. Trend stationary processes would yield much larger and stable Dickey- Fuller values. Indeed, thoughtful graphical inspection can often alert one to at least the most
extreme problems.
To check this intuition, we carried out a simulation experiment, in which series of 119 observations were generated from the second order autoregressive TS models given in Table
1 of Diebold and Senhadji 1996.
13
Moving Dickey-Fuller statistics, based on 44 observa- tions, precisely as in Fig. 1, were calculated. For each of the 2000 replications, the standard
deviations and ranges of the 76 Dickey-Fuller statistics were compared with the correspond- ing values from the actual time series. In the terminology of Tsay 1992, this can be viewed
as a parametric bootstrap specification test of the TS model. The p-values for these tests are shown in Table 1.
14
For example, in only 1.6 of replications was the standard deviation of the moving Dickey-Fuller statistics greater than that for the actual GNP-R data. Put differ-
ently, 98.4 of the 2000 replicated standard deviations of the 76 Dickey-Fuller statistics had values less than those of Fig. 1. Stated another way, if the actual data were TS then one
Fig. 1. Moving Dickey–Fuller statistics. 90
P. Newbold et al. Journal of Economics and Business 53 2001 85–102
would obtain the actual moving Dickey-Fuller statistic pattern or one more extreme only 1.6 of the time for GNP-R. This means that one would reject the null of TS at the 1.6
level with these large values. The moving Dickey-Fuller parametric bootstrap tests which we offer here, based on
standard deviations, suggest that the adequacy of the TS specifications can be rejected at significance levels between 1.6 for GNP-R and 3.5 for GNP-BGPC, while the tests
based on ranges generate rejections at levels of practically zero. It is for this reason that, though we concur with the conclusion of Diebold and Senhadji 1996 and others, that DS
over the whole 119-year period is unlikely, we are even more skeptical about the hypothesis of TS over this period.
Our test procedure indicates that it is very unlikely that the Dickey-Fuller statistics of Fig. 1 could have come from a TS AR2 model. The implausibility of Fig. 1 under TS can be
seen graphically by comparison with Fig. 2, generated from a typical series simulated from the Diebold-Senhadji TS model for GNP-R. We chose a realization giving a standard
deviation at the median over all replications. The contrast with Fig. 1 is quite stark.
Table 1 p-values of parametric bootstrap tests of trend-stationary models, based on moving Dickey-Fuller tests
GBP-R GNP-RPC
GNP-BG GNP-BGPC
Std. deviation 0.016
0.028 0.025
0.035 Range
0.000 0.000
0.004 0.000
Fig. 2. Simulated moving Dickey–Fuller statistics. 91
P. Newbold et al. Journal of Economics and Business 53 2001 85–102
This bootstrap method is not without limitations. One might argue that in order to test for power we should also reverse the null to be DS and redo the test. But issues of power are
more of a concern when one is not able to reject the null of TS when DS is true. In the case where b i.e. the likelihood of not rejecting TS, given that DS is true is large, the power of
the test will be low.
15
Reversing things and generating moving Dickey-Fuller tests from a DS series, would yield p-values that are very large. That is, the likelihood of getting the observed
actual series would be very high in this case. This means that the likelihood of not rejecting TS, given that DS is true b, would be small. With a small value of b the power of the test
would be high.
16
Regardless of the above, since our objective is to investigate the degree of evidence against TS, repeating our simulation, replacing the current null of TS with DS,
would yield little as Diebold and Senhadji 1996 have already shown there is evidence against the DS model.
Although the earlier Dickey-Fuller tests and ARMA models clearly cast doubt on the DS specification over the 119 years of data, from the moving Dickey-Fuller tests it appears
equally unlikely that the series over 119 years was generated from a TS model.
17
The progression over time of the graphs of Fig. 1 is interesting and suggestive. As observations
from the early 1930s enter towards the end of a block of data, we find Dickey-Fuller statistics close to zero, suggesting, on the surface, virtually no evidence against DS. If the true process
was TS, we could view this as reflecting the finding of Perron 1989 of very low power of Dickey-Fuller tests in the presence of a structural break. By contrast, when this
possible break is early in the block, the Dickey-Fuller statistics are very far from zero. Leybourne, Mills and Newbold 1998 have shown that, if the true generating process is
DS with an early but not late break, spurious rejections by Dickey-Fuller tests will frequently occur.
Our own view is that the series of RGNP over the whole period 1875–1993 are neither trend-stationary nor difference-stationary. Indeed, a casual inspection of a graph of the data
would make this point transparent see Figs. 5 and 6, discussed in the next section. The aberrant behavior of the time series over the period 1930 –1949 is a strong factor in arriving
at this conclusion. The behavior of the series over a period of approximately twenty years, from 1930 –1949, is quite different from anything previously or subsequently. It is simply
unreasonable to believe that the same generating regime operated over this period as elsewhere. As we have noted, it is precisely this period that generates the peculiar graphs of Fig. 1.
One possibility might be to attempt to model the entire 119-year series making special allowance for the 1930 –1949 period. This might be achieved through allowing for outliers,
structural breaks, increased volatility, and so on. The possibilities are virtually endless, and given that the offending period covers just twenty observations a thorough analysis would
quickly exhaust available degrees of freedom. Moreover, consideration of the history of the period, with depression ushered in by the Great Crash, the ensuing New Deal, the onset of
World War II, and the following recovery period, suggests no simple structure would be adequate. An extensive search might well reveal a model that reasonably fits this period, but
to adopt such a model would be unscientific, as there is no comparable subsequent period over which to verify it. Our preference, followed in the next section, is to ignore the years
1930 –1949 in assessing trend-stationarity.
92 P. Newbold et al. Journal of Economics and Business 53 2001 85–102
4. US real GNP, 1875–1929 and 1950 –1993