Directory UMM :Data Elmu:jurnal:T:Transportation Research Part B Methodological:Vol34.Issue1.Jan2000:
Transportation Research Part B 34 (2000) 53±73
www.elsevier.com/locate/trb
Trip distribution forecasting with multilayer perceptron neural
networks: A critical evaluation
M. Mozolin a, J.-C. Thill
b,*
, E. Lynn Usery
c
a
b
ESRI, Inc. Redlands, CA, USA
Department of Geography and National Center for Geographic Information and Analysis, State University of New York
at Bualo, Amherst, NY, USA
c
Department of Geography, University of Georgia, Athens, GA, USA
Received 17 February 1998; received in revised form 15 April 1999; accepted 19 April 1999
Abstract
This study compares the performance of multilayer perceptron neural networks and maximum-likelihood doubly-constrained models for commuter trip distribution. Our experiments produce overwhelming
evidence at variance with the existing literature that the predictive accuracy of neural network spatial interaction models is inferior to that of maximum-likelihood doubly-constrained models with an exponential
function of distance decay. The study points to several likely causes of neural network underperformance,
including model non-transferability, insucient ability to generalize, and reliance on sigmoid activation
functions, and their inductive nature. It is concluded that current perceptron neural networks do not
provide an appropriate modeling approach to forecasting trip distribution over a planning horizon for
which distribution predictors (number of workers, number of residents, commuting distance) are beyond
their base-year domain of de®nition. Ó 1999 Elsevier Science Ltd. All rights reserved.
1. Introduction
A number of modeling approaches have been put forward over the years to distribute trips,
freight or information among origins and destinations. One of the more successful ones is the
spatial interaction (or gravity) model (Ortuzar and Willumsen, 1994). This model relates the
matrix of ¯ows to a matrix of interzonal impedance. Traditionally, the spatial interaction model is
calibrated by one of several well known techniques, including regression, maximum likelihood, or
by numerical heuristics. Several recent studies (Black, 1995; Fischer and Gopal, 1994; Gopal and
Fischer, 1996; Openshaw, 1993) have proposed the neural network architecture as a means to
*
Corresponding author. Tel.: +716 645 2722; fax: +716 645 2329; e-mail: [email protected]alo.edu
0191-2615/99/$ - see front matter Ó 1999 Elsevier Science Ltd. All rights reserved.
PII: S 0 1 9 1 - 2 6 1 5 ( 9 9 ) 0 0 0 1 4 - 4
54
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
model the distributed complexity of spatial interaction. 1 This line of research has shown that
neural networks generally outperform classical calibration and estimation approaches.
At ®rst, this conclusion should not come as much of a surprise to many modelers given the wide
success experienced by neural networks in pattern recognition and classi®cation (Bishop, 1995;
Smetanin, 1995; Ripley, 1996), as well as in various application ®elds of transportation engineering and planning (Dougherty, 1995; Himanen et al., 1998; Hua and Faghri, 1994). After all,
neural networks impose less constraints on the form of the functional relationship between inputs
and outputs than conventional ®tting techniques. This paper revisits this conclusion by paying
attention to several aspects of spatial interaction modeling that have not been addressed so far.
Our aim is to compare the performance of a perceptron neural network (NN) spatial interaction model to that of a baseline, conventionally estimated spatial interaction model beyond the
comparative work done previously. The comparison is conducted empirically on journey-to-work
patterns in the Atlanta metropolitan area. Our approach diers drastically from others in several
respects.
Firstly, we evaluate the models in a predictive mode. In other words, calibration is done on
observed, base-year data, while testing is conducted on data for the projection year. To the best of
our knowledge, all other NN studies of trip distribution have used the same origin-destination
matrix for training and testing, thus allowing the network to learn the noise in the training data
(Black, 1995). Incidently, NN applications to trac data and other transportation problems also
use hold out samples for testing. Secondly, our baseline model is a doubly-constrained model
estimated by maximum likelihood. This is a departure from Fischer and Gopal (1994), and Gopal
and Fischer (1996) who chose the less accurate unconstrained spatial interaction model as a
benchmark, and estimated model parameters by ordinary least squares regression, a method
considered less precise than maximum likelihood (Fotheringham and O'Kelly, 1989).
Thirdly, we evaluate the models on origin-destination matrices of dierent sizes (from hundreds
of origin/destination zones down to a dozen) to test the sensitivity of our conclusions to the size of
the interaction system being modeled. Finally, we apply an adjustment factor to ¯ows predicted
by the NN output to satisfy production and attraction constraints, and thus make it possible to
unambiguously interpret any discrepancy with ¯ows predicted by the baseline doubly-constrained
model in terms of relative performance of the models.
The paper presents a case where a conventional spatial interaction model outperforms a
multilayer perceptron NN model of spatial interaction. The predictive mode of the analysis
replicates the process by which trip distribution is realized in transportation planning, and thus
helps to compare the merits of the conventional and NN approaches for practical applications of
spatial interaction modeling.
The remainder of the paper is organized as follows. The next two sections present an overview
of the conventional spatial interaction model of journey-to-work, and of the multilayer perceptron NN model. In the following section, we describe the setup of the empirical test of the
latter model against the former, as well as the data used in the test. Next, results under dierent
1
Some work has also been done on estimating origin-destination matrices from trac counts or cordon intercept
survey data with neural networks (for instance, Kikushi et al., 1993). This type of problem is out of the scope of this
paper.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
55
modeling con®gurations are detailed. We conclude with a discussion of possible explanations for
the better performance of the conventional spatial interaction model.
2. Journey to work problem and its conventional solution
Spatial interaction may be de®ned in general terms as any ¯ow of commodity, people, capital,
or information over space resulting from some explicit or implicit decision process (Fotheringham
and O'Kelly, 1989). Journey to work is one kind of spatial interaction. Other kinds of spatial
interaction include journey to school, shopping trips, non-home based intraurban trips, intercity
population migration, choice of college or university by students, intercity freight movement,
telephone calls, Internet access, and many others.
Spatial interaction models are often classi®ed on the basis of the number and character of
constraints imposed on the predicted trip matrix. Constraints represent a priori knowledge about
the total interaction ¯ows entering and/or exiting a particular zone. For example, if the total
number of employed residents in each zone i(Oi ) is known exogenously, then the sum of predicted
¯ows leaving each zone is equal to Oi
X
Tij Oi ; 8i
1
j
where Tij is the ¯ow of commuters from zone i to zone j predicted by the model. Similarly if one
knows total employment in each zone (Dj ), one can impose that the sum of predicted commuter
¯ows ending in each zone is equal to Dj :
X
Tij Dj ; 8j:
2
i
If Eq. (1) holds for each origin zone then the model is said to be production constrained; if Eq. (2)
holds for each destination zone then the model is referred to as attraction constrained. If both
Eqs. (1) and (2) hold, the model is doubly constrained, while if neither of the two holds, the model
is unconstrained.
Trip distribution may be modeled with any number of constraints. Implementing the additional
constraints requires more a priori information. In turn, the reduction of degrees of freedom leads
to a more accurately predicted ¯ow matrix. It has been shown empirically that estimating spatial
interaction with a doubly-constrained model yields the most accurate results. See, for example,
Fotheringham and O'Kelly (1989) on interregional migration among the nine major census divisions in the United States, or Mozolin (1997) on commuting trips within metropolitan Atlanta.
Because of its higher accuracy in modeling trip distribution, the doubly-constrained model is a
proper baseline against which to evaluate neural networks.
The doubly-constrained spatial interaction model of journey to work can be formulated
mathematically as:
Tij Ai Oi Bj Dj f cij ;
3
where cij is the travel impedance (distance) from zone i to zone j, f(cij ) is a distance decay
function, Oi is the number of workers resident in zone i (production), Dj is the number of workers
56
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
in zone j (attraction), and Ai and Bj are balancing coecients ensuring that Eqs. (1) and (2) are
satis®ed:
1
4
Ai P
Bj Dj f cij
j
and
1
Bj P
:
Ai Oi f cij
5
i
Two alternative speci®cations of the distance decay function f(cij ) will be used here: the negative
power function cÿb
ij b P 0, and the negative exponential function exp ÿbcij b P 0.
Of all the methods suggested to calibrate spatial interaction models (see, for instance, Bacharach (1970), Batty (1976), Evans (1971), Fotheringham and O'Kelly (1989), Ortuzar and
Willumsen (1994), Wilson (1970)), we choose to use a maximum-likelihood estimation (MLE)
approach. Batty and Mackie (1972) have shown that likelihood maximization boils down to
solving a non-linear equation. With a power distance function, this equation is given by
XX
XX
Tij ln cij
Tij ln cij
6
i
j
i
j
where Tij is the actual number of commuters from zone i to zone j, and Tij is estimated by
Eqs. (3)±(5). Mutatis mutandis with an exponential distance function. The SIMODEL computer
code (Williams and Fotheringham, 1984) is used to derive parameter estimates.
3. Multilayer perceptron neural networks applied to the journey to work problem
Background of multilayer perceptron neural networks is presented below before we proceed
with their application to the journey to work problem. The multilayer perceptron neural network
is one of a variety of parallel computing techniques that conceptually mimic structures and
functions of human central neural systems. The model used in this study is a three-layer fullyconnected feedforward NN which consists of input nodes representing independent variables (the
productions, the attractions, and the travel impedances), hidden nodes, and one output node for
the dependent variable, namely the ¯ow Tij from zone i to zone j. See Fig. 1 for the architecture of
a NN with four hidden nodes. Each input node corresponds to an independent variable in the ¯ow
model while the dependent variable Tij is the output node. The network output (activation) z is
obtained by a double logistic transform of the weighted sum of inputs. The reader is referred to
Haykin (1998), Rojas (1996), or Smith (1993) for an in-depth coverage of the NN methodology.
The most valuable property of multilayer feedforward NNs is their ability to approximate a
desired function from training examples. In fact, a three-layer fully connected feedforward NN
with n input nodes, a sucient number of hidden nodes, and one output node can be trained to
approximate an n to 1 mapping function of arbitrary complexity (Kreinovich and Sirisaengtaskin,
1993). Learning of network weights often proceeds by backpropagation of errors (Rumelhart
et al., 1986) so as to minimize the total error for all examples in a training set.
The backward pass is a recursive procedure (Rumelhart et al., 1986) during which the partial
derivatives of all network weights with respect to the error used to adjust the weights up or down
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
57
Fig. 1. Fully-connected three-layer perceptron neural network.
to reduce total error. In this research, we use an o-line, or epoch-based learning: network weights
are adjusted only after all examples in the training set have been processed. Several non-linear
optimization methods are available to ®nd a set of weights that minimizes the error on all examples in the training set. This study uses the Quickprop algorithm developed by Fahlman (1989).
Though it does not guarantee a global optimum, its quick convergence dramatically increases the
speed of NN training. The gradient descent method (Rumelhart et al., 1986) is applied in those
rare instances where Quickprop cannot be used.
Backpropagation neural networks easily ®t into the framework of the doubly-constrained
spatial interaction model. The network learns the mapping function that best ®ts the relationship
between dependent variables (production, attraction, and travel impedance) and the independent
variable (¯ows). Interestingly, the mapping function is no longer restricted to either power or
exponential functional form as in the conventional models. Nor is it explicitly speci®ed as a linear
or non-linear regression model. The major advantage of the NN approach is that it is ¯exible
enough to model non-linear relationships of arbitrary complexity in an automated fashion.
As noted by Black (1995), Fischer and Gopal (1994), and Gopal and Fischer (1996), a NN may
perform well enough to estimate actual spatial interaction ¯ows, but small deviations are bound
to remain. Furthermore, the network itself does not contain any mechanism to enforce the origin
and destination constraints. Consequently, the origin and destination totals derived by summing
the ¯ows predicted by the model are usually not equal to the actual origin and destination totals.
58
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
We use a standard iterative proportional ®tting procedure (Slater, 1976) to enforce these constraints. After this post-processing, NN ¯ows give predictions comparable to those of a doublyconstrained model.
Network training is realized with NevProp 1.16 (Goodman, 1996). The Quickprop algorithm is
embedded in NevProp, but pre- and post-processing (including scaling and enforcing production
and attraction constraints) are part of separate applications written by the authors.
4. Empirical analysis
4.1. Study Area and Data Sources
We use 1980 and 1990 journey-to-work commuter ¯ows in the Atlanta Metropolitan Area.
Commuter ¯ows among the 15 counties of Atlanta SMSA for 1980 are available from the 1980
U.S. Census (Bureau of the Census, 1983). Commuter ¯ows among the 20 counties of the Atlanta
MSA (Fig. 2) for 1990 are available in the Census Transportation Planning Package (CTPP)
(Bureau of Transportation Statistics, 1993). Data sets on commuting ¯ows between census tracts
in 1980 and 1990 were kindly made available to the authors by the Atlanta Regional Commission.
There were 345 census tracts in 1980, and 507 census tracts in 1990 in the study area.
Fig. 2. Counties of Atlanta Metropolitan Statistical Area (MSA) in 1990.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
59
The logistics of spatial interaction modeling requires a clearly de®ned region with no, or small,
¯ows across its border. In the case of Atlanta, this assumption is not grossly violated. In 1980,
slightly more than 90% of working residents of the Atlanta SMSA also worked inside the SMSA.
In 1990, 95.3% of working residents of the Atlanta MSA worked inside the MSA. Also, 92.9%
employed in the Atlanta MSA also lived within the region in 1990.
Spatial separation between commuting zones (counties or census tracts) is measured by the
straight-line (Euclidean) distance between zone centroids in the metropolitan area. Setting intrazone distance to zero is known to generate systematic measurement errors. It is common
practice to correct for this error by de®ning the distance from a zone to itself as a quarter of the
distance from the zone centroid to the centroid of its nearest neighbor (Thomas and Hugget,
1980). The Euclidean distance is only an approximation of the perceived impedance between
home and work locations. We recognize that road distance, travel time, or a generalized travel
cost function may dierently aect the predictive accuracy of the MLE and NN models. However,
since the major thrust of this study is to compare and evaluate two forecasting techniques, the
relative accuracy of their estimates matters more than their absolute accuracy and the approximation given by Euclidean distance is acceptable. 2
4.2. Implementation issues
The journey-to-work analysis is structured as follows. First, two doubly-constrained spatial
interaction models are calibrated by MLE on 1980 travel data. One model uses county-level data,
the second uses census tract-level data. Each model is then employed to forecast interzonal
commuter ¯ows (intercounty or intertract) for year 1990 with the calibrated distance decay parameter b and 1990 working population and employment data at the corresponding geographic
resolution (Oi and Dj marginal totals) for production and attraction, respectively. Forecasted
¯ows are compared to actual 1990 trip data using four goodness-of-®t measures (the absolute
error (AE), the standardized root mean square error (SRMSE), Kulback's w statistic, and the
R-square. See Fotheringham and Knudsen (1987), Weiss (1995), and others for a description of
the statistics. In most cases, these measures are highly consistent. Hence, only the AE and SRMSE
measures are reported for the tested models hereunder.
In parallel, two sets of NN spatial interaction models are trained and validated on the same
1980 travel data-one on county-level ¯ows, the other on census tract ¯ows. With the network
weights for which the validation error is minimum and 1990 population and employment data, the
NNs predict 1990 interzonal commuter ¯ows at the county and census tract levels respectively
(test sets). Goodness-of-®t of these forecasts to actual ¯ows is once again measured. Finally, the
relative performance of MLE and NN spatial interaction models in predictive mode is assessed by
comparing goodness-of-®t measures.
2
In fact, preliminary test results in Mozolin (1997) indicate that the goodness-of-®t of an MLE doubly-constrained
model is enhanced by substituting average reported travel time for Euclidean distance, and vice versa for the NN
formulation. Consequently, if the use of the Euclidean distance as a proxy for spatial impedance biased our test, it is
most likely to be in favor of the NN model, which places our conclusions of the conservative side.
60
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Selecting a NN con®guration and parameters suitable for a certain problem is often a challenging task. The ®rst stage of backpropagation feedforward NN model design entails setting the
topology of the model. A natural topology for a doubly-constrained spatial interaction problem
involves three inputs (the number of resident workers in the zone of origin, employment in the
destination zone, and the spatial impedance between the zones) and one output (the number of
commuters). It is common practice to proceed by trial and error to select the number of hidden
nodes, and to test networks with hidden layers of varying size. Networks with 5, 20, and 50 hidden
nodes are tested in this study. Networks of larger sizes are impractical due to the excessive
computational requirement of their training. Each network con®guration is processed ®ve times,
each run starting with a random set of initial weights and a training set drawn randomly from the
full data set. We report results of experiments with dierent partitions of the full data set into
training and validation sets.
The NN model is further speci®ed as follows. Since, in most instances, weights are changed
according to the Quickprop rule, no momentum term is needed, and the learning rate must be
speci®ed only for use with the gradient descent method. 3 A 0.1 learning rate is used throughout
the analysis. Experiments with dierent rates lead to remarkably similar weight estimates and
learning speeds. Initial weights are randomly drawn from a uniform distribution within the range
[ÿ0.01, +0.01].
All three network inputs are scaled by dividing the value observed for each example by the
input's maximum value in the set. Whereas input scaling is optional, scaling of the output is
required for successful learning. Scaling to ®t the output within the [0.1, 0.9] range is usually used.
However, because the networks are tested on data other that those used for training and validation, and that total ¯ows have increased between base and prediction years, the interval is
scaled to 0.75. Therefore, the output (the number of commuters) is scaled using a linear transformation to have 1980 ®t the [0.25, 0.75] range. This transformation is given by
Tij
Tijnetwork 0:25 0:5
;
7
Tmax1980
where Tijnetwork is the output as seen by the network, and Tmax1980 is the maximum commuter ¯ow in
1980. At the testing stage, Eq. (7) is used in reverse.
All networks are trained for a maximum of 100 000 iterations. Many neural network practitioners allow for an early stopping of the feedforward backpropagation algorithm (Sarle, 1996) in
order to prevent over®tting. It is critical to realize that the error on the validation set is not a good
estimate of network generalization. A stopped network is tested on an independent test set that
has never been used for training to give an unbiased estimate on the network performance. We
train and validate all networks on the 1980 data, while the testing is accomplished on the 1990
data.
At the county level, a total of 225 data vectors are available. For each network processed, the
training set is formed by randomly selecting 112 vectors without replacement, while the remaining
113 vectors are used for validation. In one experiment, the full set of vectors is used both for
3
Fahlman (1989) also suggests adding a small constant (known as the sigmoid prime oset) to the error derivative to
avoid very slow learning when z is close to 0 or 1. The recommended value is set 0.1. The Quickprop Maximum Growth
Factor is to 1.75, as suggested by Fahlman (1989).
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
61
training and validation. The network weights that minimize validation error serve to test the
model on the 400 interactions from the 1990 trip matrix.
At the census-tract level, the training set is selected by simple random sample without replacement of 200 examples from the 121 104 origin-destination pairs in the 1980 tract-to-tract trip
matrix. 4 Similarly for the validation set. The optimal set of network weights is then tested on all
257 049 vectors from the 1990 tract-level trip matrix.
5. Results of performance comparison
5.1. Baseline spatial interaction models
The results of calibrating and testing maximum likelihood doubly-constrained models of
journey to work at the county and census-tract levels are presented in Table 1. The overall performance of these models with an exponential function of spatial deterrence is sucient to provide
a benchmark against which to evaluate the performance of NN models. 5
Of the two distance decay functions, better performance is exhibited by the negative exponential function in terms of all four goodness-of-®t measures, and at both aggregation levels
(county and census tract). This is consistent with the widespread consensus that the exponential
function is more appropriate for analyzing short distance interactions, such as those that take
place within an urban area, while the power function is more appropriate for analyzing longer
distance interactions such as interstate migration ¯ows (Fotheringham and O'Kelly, 1989).
Table 1 clearly shows that county-level models are more accurate than models applied at the
census-tract level. Lower model performance or ®t at a more detailed geographic scale is not
unusual. This phenomenon is known in the spatial-analytic literature as the modi®able areal unit
problem (Openshaw, 1984). Treatment of the issue in the context of spatial interaction modeling
can be found in Amrhein and Flowerdew (1989), Batty and Sikdar (1982), and others. In substance, if a simple functional relationship with a single parameter, like the doubly-constrained
spatial interaction model, presents some diculties in accounting for all subtleties of 400 interactions in the 20-country trip matrix, it is more exacerbated with the 507-tract trip matrix.
Consequently, if NNs truly outperform MLE spatial interaction models, one would anticipate the
advantage to be much larger with census tracts that with county data. Along the same line, in case
NNs were not to perform as well as MLE models with the latter data, the reverse statement could
reasonably be expected for commuter ¯ows between census tracts.
5.2. Neural network models
The results of the NN training and testing on county-level data are presented in Table 2. All
®ve sets of networks exhibit good to very good ability to predict 1990 commuter ¯ows in Atlanta.
4
Tests with training sets of 1000 cases do not show better predictive accuracy than with 200 cases, while
computational time becomes prohibitively high.
5
Similar, though slightly less good results are obtained with a power function of distance.
62
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Table 1
Maximum likelihood doubly-constrained models calibrated using 1980 ¯ows, and tested on 1990 commuter ¯ows
Distance decay parameter
(b)
Absolute error (AE)
(%)
SRMSE
County level
Exponential function of distance decay
ÿ8.43 ´ 10ÿ5
24.7
0.866
Census-tract level
Exponential function of distance decay
ÿ12.17 ´ 10ÿ5
68.7
2.492
Table 2
Neural network models trained using 1980 county-to-county commuter ¯ows, and tested on 1990 commuter ¯ows
Instance
Absolute error (%)
SRMSE
(a) Five-node networks
1
2
3
4
5
Average
42.6
39.5
41.0
44.0
40.2
41.5
1.877
1.182
1.466
1.762
1.642
1.586
2500
1000
3500
13 500
8000
(b) Twenty-node networks
1
2
3
4
5
Average
47.0
41.8
50.6
32.7
52.6
44.9
1.890
1.806
1.788
1.077
1.655
1.643
500
4500
21 000
9500
500
(c) Fifty-node networks
1
2
3
4
5
Average
51.2
50.5
28.8
56.6
42.1
43.8
1.936
1.773
0.847
2.396
1.683
1.722
31 500
500
1500
40 500
9 000
0.920
0.878
0.962
1.151
1.080
0.998
100 000
100 000
100 000
100 000
100 000
(d) Twenty-node networks trained on full 1980 data
1
30.9
2
30.1
3
36.2
4
41.9
5
39.7
Average
35.8
Epoch network was stopped
Comparison of the ®rst four sets of results reported in Table 2 indicates that, except for a slight
tendency for performance to drop as more hidden nodes are used, there is no signi®cant impact of
the number of hidden nodes in a network on goodness-of-®t. It is also noteworthy that all better
performing networks were stopped after less than 10 000 epochs.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
63
The bottom part of Table 2 displays the results of testing ®ve 20-node NNs trained on the
entire 1980 data set (225 training cases). As expected, they perform rather consistently and
somewhat better than networks trained on half the available interaction pairs. Training a neural
network on the full set of cases is usually not a recommended practice because it promotes
over®tting, a point evidenced here by the failure of all ®ve networks to converge before 100 000
training epochs. On the other hand, training data is less sparse than with a sample of cases, and
the network's power to generalize input±output relationships is enhanced.
Neural networks trained and tested on tract-level ¯ows perform signi®cantly worse that those
trained on county-level data (Table 3). The best NN model produces errors that are 83.6% of 1990
commuter ¯ows, while the worst model hits a whopping 119.1% absolute error. The faint inverse
relationship between model performance and number of hidden nodes detected above is now
clearly marked. Average goodness-of-®t measures drop with the increase of hidden nodes from 5
to 50.
5.3. Neural Network versus MLE Models
None of the NN models tested on county commuter ¯ows outperforms the corresponding MLE
doubly-constrained model. The only partial exception to this rule is the case of one 50-node NN
model (#3 in Table 2(c)) which shows a better performance as measured by SRMSE and
Table 3
Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested on 1990 commuter ¯ows with
training and validation sets selected randomly
Instance
Absolute error (%)
SRMSE
Epoch network was stopped
(a) Five-node networks
1
2
3
4
5
Average
90.4
83.6
89.0
108.3
96.6
93.6
3.265
3.226
3.280
4.072
3.701
3.509
12 000
6500
5500
500
1200
(b) Twenty-node networks
1
2
3
4
5
Average
92.4
105.3
119.1
105.9
97.5
104.0
3.451
4.000
5.328
4.009
3.760
4.110
3000
1000
10 500
98 500
4000
(c) Fifty-node networks
1
2
3
4
5
Average
109.0
109.8
111.0
90.2
101.6
104.3
4.048
4.106
4.218
3.418
3.845
3.927
500
500
79 000
16 000
4000
64
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
R-square, but not according to the other two statistics. Even more remarkable, the ®ve NNs
trained on the entire 1980 data set (Table 2(d)) still fail to surpass the MLE model in any run,
though they come closer to challenging its superiority.
At the census-tract level, the comparison is even more favorable for the MLE model. All runs
of the NN models (Table 3) trail far behind the MLE model (Table 1). The best of twelve NN
models misallocates 83.6% of all commuter ¯ows, while the conventional doubly-constrained
model with the negative exponential function of distance misallocates ``only'' 68.7% of ¯ows.
It is appropriate to stress here again that model performance is evaluated in a predictive mode,
that is by the capacity of a model to predict interaction ¯ows for a horizon other than the base
year used in training and validation. In fact, performance measured on base-year data would lead
to opposite conclusions, thus supporting the existing literature in the matter. For information
purposes, performances of MLE and NN models trained and tested on the 1990 county-to-county
¯ows are reproduced in appendices A and B, respectively.
By all accounts, the evidence reviewed above that neural networks show inferior predictive
performance over conventional statistical models is quite puzzling and unexpected. Neural networks are indeed regarded as good approximators (Kreinovich and Sirisaengtaskin, 1993). The
data analysis calls for further research to pinpoint the causes of their underperformance. In order
to trace potential patterns of consistent underprediction or overprediction by NN models, we use
three-dimensional plots of observed and predicted data. Each plot displays ¯ows originating from
a given county against distance and number of workers at destinations. Such plots for a sample of
four counties, namely Clayton, Cobb, DeKalb, and Fulton counties, are depicted in Fig. 3 for
1990. Corresponding ¯ow surfaces generated by the ®ve tested instances of 20-node neural networks (see Table 2-b) are given in Fig. 4.
At examination, the predicted surfaces in Fig. 4 reveal unsuspected structures dominated by a
wavy pattern of troughs and ridges. These structures are particularly pronounced in instance three
(Fig. 4(c)), which also happens to be among the instances that predicts 1990 ¯ows with the least
overall accuracy. This pattern is often symptomatic of over®tting due to excessive training of the
network. That this network was trained longer than any other 20-node network suggests that it
learned the noise in the training set in addition to the underlying function we want it to ®nd. As a
result, its ability to generalize is rather poor and its prediction accuracy is low particularly where
training data are sparse (interpolation problem). Another feature common to several underperforming network instances in Fig. 4 is the consistent underestimation of the largest 1990 expected
¯ows (Figs. 3 and 4). 6 Networks fail to extrapolate around and beyond the limits of the training
sample. Possible explanations for interpolation and extrapolation errors are now pursued.
The spurious ridges and troughs that show throughout predicted surfaces suggest that over®tting may be occurring, in spite of the early stopping mechanism put in place to prevent it. In our
commuter trip distribution problem, we can postulate that the occurrence of over®tting is tied to
an excessive number of hidden nodes in the network. Our problem may in fact be simple enough
to require less than ®ve hidden nodes. After all, conventional spatial interaction models perform
6
Most remarkable in this matter are instances 2 and 3, for which the training set does not include the Fulton-Fulton
¯ow.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
65
Fig. 3. Actual number of commuters in 1990 as a function of distance between county of residence and country of work,
and the number of jobs in the country of work. All variables are measured as proportions of maximum 1980 values. The
``Number of workers'' axis is scaled logarithmically. Four countries of residence: a, Clayton Country; b, Cobb Country;
c, DeKalb Country; d, Fulton Country
well with a single parameter. Such a neural network could be devoid of spurious ridges and
troughs, and generalize just right.
To test the proposition that the predictive performance of NN models is improved by reducing
the number of hidden nodes, various neural networks with one to three hidden nodes are tested on
1980 county-level commuter ¯ows and validated on 1990 ¯ows. Results are summarized in
Table 4. Networks with fewer hidden nodes suer less from spurious troughs and ridges on their
prediction plots, and therefore, are less prone to over®tting. In fact surfaces generated by onehidden-node networks do not exhibit any spurious feature (Fig. 5). Such networks no longer
model the noise in the training data because they are unable to produce complex surfaces. This
does not translate however into unequivocally better goodness-of-®t with validation data for
sparser networks. Furthermore, none of the networks tested with a reduced number of hidden
66
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Fig. 4. 1990 commuter ¯ows from Fulton Country predicted by the 20-node neural networks listed in Table 2; a, ®rst
instance; b, second instance; c, third instance; d, fourth instance; e, ®fth instance. All variables are measured as proportions of their maximum 1980 values. The ``Number of workers'' axis is scaled logarithmically.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
67
Table 4
Neural network models with few hidden nodes. Trained using 1980 county-to-county commuter ¯ows, and tested using
1990 commuter ¯ows
Instance
Absolute error (%)
SRMSE
Epoch network was stopped
(a) One-node networks
1
2
3
4
5
Average
38.1
38.1
35.6
41.8
42.9
39.3
1.592
1.245
1.666
1.434
1.480
1.483
2500
1500
27 000
500
1000
(b) Two-node networks
1
2
3
4
5
Average
32.5
27.7
64.8
46.0
60.6
46.3
1.276
0.962
2.387
2.002
2.262
1.778
3000
37 000
51 000
1000
20 000
(c) Three-node networks
1
2
3
4
5
Average
31.1
51.9
41.9
43.6
38.0
41.3
1.248
2.143
1.904
2.021
1.272
1.718
500
500
2000
1000
1000
nodes (Table 4) succeeds in outperforming the MLE doubly-constrained model with exponential
function of distance (Table 1). A straightforward consequence is that lower performance of neural
networks cannot be imputed to over®tting and cannot be remediated easily by modifying the
topology of the networks.
The fact remains that neural networks have a limited ability to interpolate spatial interaction
data in a predictive mode. Paradoxically, the cause of this weakness may also be the essence of its
strength in validation on contemporary data, namely the inherent ¯exibility to approximate
complex data structures with great accuracy. In short, the poor ®t of neural networks on prediction-year data (1990) can be blamed on their unrivaled ®t to base-year data (1980). According
to this view, neural networks are such good approximators that they model not only interaction
data structures, but also the context of the transportation systems within which commuter patterns take place. By design, spatial interaction neural networks are context-dependent models
whose parameters do not transfer well to other contexts. The extent of NN context sensitivity
remains a subject for future study. A solution to this problem may come from the explicit incorporation of context dependencies in the network representation. Evidence in Table 4 suggests
that model transferability is problematic even for sparse model topologies.
It is our contention that the sigmoid form of network output limits the ability of neural networks to extrapolate interaction data in a meaningful way. Sigmoid output nodes tend indeed to
generate S-shaped predicted surfaces that are ill-suited to model spatial interaction behavior. For
68
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Fig. 5. 1990 commuter ¯ows from Fulton Country predicted by a single-node neural network. All variables are
measured as proportions of maximum 1980 values.
illustration purposes, let us compare how ¯ows predicted by the NN and conventional gravity
models respond to distance as the other two input variables are held constant. Most NN ¯ow
surfaces (Figs. 4 and 5) have in common an S-shaped pro®le of dependence between ¯ow volume
and distance. This pro®le implies that, all other things being equal, the marginal ¯ow increase with
respect to distance is small and declining, sometimes even negative. On the contrary, observed
patterns (Fig. 3) show no tapering in the relationship between ¯ow volume and distance. Consequently, ¯ow extrapolation on the S-shaped pro®le is highly inaccurate. Because enough of the
1990 ¯ow data fall outside of the range of the 1980 training data, the overall performance of the
network is generally poor. A signi®cant implication is that conventional feedforward backpropagation NNs may not exhibit the right properties for use in the application domain of trip distribution. Other NN models that do not assume a sigmoidal activation function ± such as the
Gaussian Radial Basis Function model (Verleysen and Hlavackova, 1994) ± may prove better
suited for spatial interaction problems.
In contrast to neural networks, the smooth surface generated by the MLE model with a negative exponential function of distance decay provides a better ®t to the empirical data. A good ®t
is achieved not only for the data on which the model was calibrated, but also for the unseen data
beyond the training range. This indicates that the maximum likelihood model is a better
extrapolator than the neural network, and a better tool for urban and regional planning. A
fundamental reason for better performance of the maximum-likelihood model is that, being a oneparameter model, it generalizes more than neural networks and, consequently, is more context
independent. Also contributing towards better performance is its derivation from the ®rst principles, whereas the NN approach is purely data-driven. Wilson (1970) showed in his seminal work
how the exponential distance decay function is derived from the entropy principle, by ®nding the
69
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
most likely trip matrix given the origin and destination totals and the total distance traveled in the
system. The principle of maximum likelihood applies to all trip matrices, regardless of their use
for model calibration or model testing; hence the better extrapolation capability of the maximumlikelihood model.
5.4. Geographic scale problem
The dramatically lower performance of neural networks on tract-level data suggests that additional factors are at work at this scale. The vast majority of commuter ¯ows in tract-level trip
matrices are zero (82.9% of all ¯ows in the 1980 matrix, and 82.5% in the 1990 matrix), while most
non-zero ¯ows are fairly small. With only a small fraction of ¯ows signi®cantly larger than the
rest, small random samples of training examples have little chance to include large ¯ows. As a
result, networks trained on random samples of examples primarily learn how to predict zero and
very small ¯ows. Since we established earlier that neural networks are rather poor extrapolators of
spatial interaction ¯ows, their predictions of larger ¯ows is highly inaccurate. Hence the low
overall performance of neural networks on small analysis zones.
Resorting to larger samples (say, more than 1000 cases), or even to the entire population of
samples, is not a practical solution because it leads to unacceptably long training. An appealing
alternative consists in using strati®ed random sampling instead of uniform random sampling in
order to represent ¯ows of all sizes in the training set. The eectiveness of this strategy is now
assessed with two distinct strati®ed sampling schemes.
In strategy I, 20 examples of zero ¯ows and 180 examples of non-zero ¯ows are selected randomly without replacement from 121,104 interactions in 1980. In strategy II, we select 10 examples of zero ¯ows, and 10 randomly-selected examples from each bin of origin-destination pairs
Table 5
Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested using 1990 commuter ¯ows with
training and validation sets selected using strati®ed random sampling
Instance
Absolute error (%)
SRMSE
Epoch network
was stopped
Sampling strategy I; Five-node networks
1
2
3
4
5
Average
90.5
91.0
94.7
90.8
94.6
92.3
3.407
3.673
3.766
3.416
3.502
3.553
8000
1000
500
1500
2000
Sampling strategy II; Five-node networks
1
2
3
4
5
Average
93.4
91.0
90.3
90.9
93.4
91.8
3.611
3.565
3.540
3.564
3.613
3.578
1000
500
500
500
1000
70
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
de®ned by 10-unit increments on the ¯ows. In both strategies, validation sets are selected similarly. The testing results for a 5-node network are presented in Table 5.
Comparison of these goodness-of-®t results to those of the ®ve-node network trained on a
simple random sample Table 3 reveals no signi®cant improvement. The ®ve-node networks with
training and validation sets selected using strati®ed random sampling have an average absolute
error of 92.3% for sampling strategy I, and 91.8% for sampling strategy II, against 93.6% with
uniform random sampling. This piece of evidence suggests that using strati®ed random sampling
instead of uniform random sampling to select the training set does not improve the accuracy of
NN spatial interaction models. More complex strati®cation strategies may produce better results,
but we leave this investigation for the future.
6. Conclusions
This study compared the performance of multilayer perceptron neural networks and maximumlikelihood doubly-constrained models for commuter trip distribution. Our experiments produced
overwhelming evidence that NN models may ®t data better but their predictive accuracy is poor in
comparison to that of maximum-likelihood doubly-constrained models. What our thorough study
failed to identify are perceptron model con®gurations that consistently exhibit a predictive performance that surpasses that of maximum-likelihood doubly-constrained models. It points to
several likely causes of neural network underperformance, including model non-transferability,
insucient ability to generalize, reliance on sigmoid activation functions, and their essence as
data-driven techniques. An agenda for future research is also proposed to explore the potential for
other perceptron formulations (i.e., spatial structure as NN input) and other neural networks
(RBF, for instance) to predict spatial interaction ¯ows with greater accuracy.
This conclusion is at variance with the existing literature which has been overly optimistic
about the advantages of modeling trip distribution by spatial interaction with backpropagation
neural networks. While neural networks may perform better than conventional models in modeling spatial interaction for the base year, they fail to outperform the MLE doubly-constrained
model for forecasting purpose, which is the motivation behind these modeling eorts in the ®rst
place. Therefore, current perceptron neural networks do not provide an appropriate modeling
approach to forecasting trip distribution over a planning horizon for which distribution predictors
(number of workers, number of residents, commuting distance) are well beyond their base-year
domain of de®nition.
Acknowledgements
The authors are grateful to Dr. Frank Koppelman. His insightful comments on an earlier
version of the manuscript were instrumental in enhancing its quality.
Appendix A
Maximum likelihood doubly-constrained models calibrated and tested using 1990 commuter
¯ows among the counties of the Atlanta MSA.
71
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Exponential function of distance decay
Distance decay
parameter (b)
Absolute error
(AE) (%)
SRMSE
ÿ7.64 ´ 10ÿ5
24.0
0.728
Appendix B
Neural network models trained and tested using 1990 county-to-county commuter ¯ows in
Atlanta
Instance
Absolute error (%)
SRMSE
Epoch network
was stopped
Five-node networks
1
2
3
4
5
Average
27.1
18.2
23.3
18.3
21.1
21.6
0.723
0.470
0.634
0.463
0.520
0.562
100 000
100 000
100 000
100 000
100 000
Twenty-node networks
1
2
3
4
5
Average
24.3
15.2
21.4
27.3
24.1
22.5
0.585
0.379
0.554
0.637
0.636
0.558
100 000
100 000
100 000
100 000
100 000
8.6
8.4
8.6
8.7
10.7
9.0
0.169
0.168
0.166
0.168
0.212
0.176
100 000
100 000
100 000
100 000
100 000
Fifty-node networks
1
2
3
4
5
Average
References
Amrhein, C.G., Flowerdew, R. (1989). The eect of data aggregation on a Poisson regression model of Canadian
migration. The Accuracy of Spatial Databases Goodchild, M., Gopal, S. pp. 229±238. Taylor and Francis, London.
Bacharach, M., 1970. Biproportional Matrices and Input-Output Change. Cambridge University Press, Cambridge.
Batty, M., 1976. Urban Modeling: Algorithms, Calibrations, Predictions. Cambridge University Press, Cambridge.
72
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Batty, M., Mackie, S., 1972. The calibration of gravity, entropy, and related models of spatial interaction. Environment
and Planning A 4, 205±233.
Batty, M., Sikdar, P.K., 1982. Spatial aggregation in gravity models: 1. An information-theoretic framework.
Environment and Planning A 14, 377±405.
Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford.
Black, W.R., 1995. Spatial interaction modeling using arti®cial neural networks. J. Transport Geography 3 (3), 159±
166.
Bureau of the Census, 1983. 1980 Census of Population and Housing, Census Tracts, Atlanta, GA. PHC80-2-77. US
Department of Commerce, Bureau of the Census, Washington.
Bureau of Transportation Statistics (1993) 1990 Census Transportation Planning Package. US Department of
Transportation, Bureau of Transportation Statistics. CD-Rom, Washington.
Dougherty, M., 1995. A review of neural networks applied to transport. Transportation Research C 3, 247±260.
Evans, A.W., 1971. The calibration of trip distribution models with exponential or similar cost functions.
Transportation Research 5, 15±38.
Fahlman, S.E., 1989. Faster-learning variations on back-propagation: An empirical study. Proceedings of the 1988
Connectionist Models Summer School Touretzky, D., Hinton, G., Sejnowski, T. (Eds). pp. 38±51. Morgan
Kaufmann, San Mateo.
Fischer, M.M., Gopal, S., 1994. Arti®cial neural networks: A new approach to modeling interregional telecommunication ¯ows. J. Regional Science 34, 503±527.
Fotheringham, A.S., Knudsen, D.C., 1987. Goodness-of-®t Statistics. CATMOG series. Geo Abstracts, Norwich.
Fotheringham, A.S., O'Kelly, M.E., 1989. Spatial Interaction Models: Formulations and Applications. Kluwer,
London.
Goodman, P.H., 1996. NevProp software, ver. 3. Reno, NV: University of Nevada, URL: http://www.scs.unr.edu/
nevprop/.
Gopal, S., Fischer, M.M., 1996. Learning in single hidden-layer feedforward network: Backpropagation in a spatial
interaction modeling context. Geographical Analysis 28, 38±55.
Haykin, S.S., 1998. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River.
Himanen, V., Nijkamp, P., Reggiani, A., 1998. Neural Networks in Transport Applications. Ashgate, Brook®eld.
Hua, J., Faghri, A., 1994. Applications of arti®cial neural networks to intelligent vehicle-highway systems.
Transportation Research Record 1453, 83±90.
Kikushi, S., Nanda, R., Perincherry, V., 1993. A method to estimate trip-O-D patterns using a neural network
approach. Transportation Planning and Technology 17, 51±65.
Kreinovich, V., Sirisaengtaskin, O., 1993. Universal approximators for functions and for control strategies. Neural,
Parallel, and Scienti®c Computations 1, 325±346.
Mozolin, M.V., 1997. Spatial interaction modeling with an arti®cial neural network Discussion Paper. Series 97-1,
Department of Geography, University of Georgia, Athens, GA.
O
www.elsevier.com/locate/trb
Trip distribution forecasting with multilayer perceptron neural
networks: A critical evaluation
M. Mozolin a, J.-C. Thill
b,*
, E. Lynn Usery
c
a
b
ESRI, Inc. Redlands, CA, USA
Department of Geography and National Center for Geographic Information and Analysis, State University of New York
at Bualo, Amherst, NY, USA
c
Department of Geography, University of Georgia, Athens, GA, USA
Received 17 February 1998; received in revised form 15 April 1999; accepted 19 April 1999
Abstract
This study compares the performance of multilayer perceptron neural networks and maximum-likelihood doubly-constrained models for commuter trip distribution. Our experiments produce overwhelming
evidence at variance with the existing literature that the predictive accuracy of neural network spatial interaction models is inferior to that of maximum-likelihood doubly-constrained models with an exponential
function of distance decay. The study points to several likely causes of neural network underperformance,
including model non-transferability, insucient ability to generalize, and reliance on sigmoid activation
functions, and their inductive nature. It is concluded that current perceptron neural networks do not
provide an appropriate modeling approach to forecasting trip distribution over a planning horizon for
which distribution predictors (number of workers, number of residents, commuting distance) are beyond
their base-year domain of de®nition. Ó 1999 Elsevier Science Ltd. All rights reserved.
1. Introduction
A number of modeling approaches have been put forward over the years to distribute trips,
freight or information among origins and destinations. One of the more successful ones is the
spatial interaction (or gravity) model (Ortuzar and Willumsen, 1994). This model relates the
matrix of ¯ows to a matrix of interzonal impedance. Traditionally, the spatial interaction model is
calibrated by one of several well known techniques, including regression, maximum likelihood, or
by numerical heuristics. Several recent studies (Black, 1995; Fischer and Gopal, 1994; Gopal and
Fischer, 1996; Openshaw, 1993) have proposed the neural network architecture as a means to
*
Corresponding author. Tel.: +716 645 2722; fax: +716 645 2329; e-mail: [email protected]alo.edu
0191-2615/99/$ - see front matter Ó 1999 Elsevier Science Ltd. All rights reserved.
PII: S 0 1 9 1 - 2 6 1 5 ( 9 9 ) 0 0 0 1 4 - 4
54
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
model the distributed complexity of spatial interaction. 1 This line of research has shown that
neural networks generally outperform classical calibration and estimation approaches.
At ®rst, this conclusion should not come as much of a surprise to many modelers given the wide
success experienced by neural networks in pattern recognition and classi®cation (Bishop, 1995;
Smetanin, 1995; Ripley, 1996), as well as in various application ®elds of transportation engineering and planning (Dougherty, 1995; Himanen et al., 1998; Hua and Faghri, 1994). After all,
neural networks impose less constraints on the form of the functional relationship between inputs
and outputs than conventional ®tting techniques. This paper revisits this conclusion by paying
attention to several aspects of spatial interaction modeling that have not been addressed so far.
Our aim is to compare the performance of a perceptron neural network (NN) spatial interaction model to that of a baseline, conventionally estimated spatial interaction model beyond the
comparative work done previously. The comparison is conducted empirically on journey-to-work
patterns in the Atlanta metropolitan area. Our approach diers drastically from others in several
respects.
Firstly, we evaluate the models in a predictive mode. In other words, calibration is done on
observed, base-year data, while testing is conducted on data for the projection year. To the best of
our knowledge, all other NN studies of trip distribution have used the same origin-destination
matrix for training and testing, thus allowing the network to learn the noise in the training data
(Black, 1995). Incidently, NN applications to trac data and other transportation problems also
use hold out samples for testing. Secondly, our baseline model is a doubly-constrained model
estimated by maximum likelihood. This is a departure from Fischer and Gopal (1994), and Gopal
and Fischer (1996) who chose the less accurate unconstrained spatial interaction model as a
benchmark, and estimated model parameters by ordinary least squares regression, a method
considered less precise than maximum likelihood (Fotheringham and O'Kelly, 1989).
Thirdly, we evaluate the models on origin-destination matrices of dierent sizes (from hundreds
of origin/destination zones down to a dozen) to test the sensitivity of our conclusions to the size of
the interaction system being modeled. Finally, we apply an adjustment factor to ¯ows predicted
by the NN output to satisfy production and attraction constraints, and thus make it possible to
unambiguously interpret any discrepancy with ¯ows predicted by the baseline doubly-constrained
model in terms of relative performance of the models.
The paper presents a case where a conventional spatial interaction model outperforms a
multilayer perceptron NN model of spatial interaction. The predictive mode of the analysis
replicates the process by which trip distribution is realized in transportation planning, and thus
helps to compare the merits of the conventional and NN approaches for practical applications of
spatial interaction modeling.
The remainder of the paper is organized as follows. The next two sections present an overview
of the conventional spatial interaction model of journey-to-work, and of the multilayer perceptron NN model. In the following section, we describe the setup of the empirical test of the
latter model against the former, as well as the data used in the test. Next, results under dierent
1
Some work has also been done on estimating origin-destination matrices from trac counts or cordon intercept
survey data with neural networks (for instance, Kikushi et al., 1993). This type of problem is out of the scope of this
paper.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
55
modeling con®gurations are detailed. We conclude with a discussion of possible explanations for
the better performance of the conventional spatial interaction model.
2. Journey to work problem and its conventional solution
Spatial interaction may be de®ned in general terms as any ¯ow of commodity, people, capital,
or information over space resulting from some explicit or implicit decision process (Fotheringham
and O'Kelly, 1989). Journey to work is one kind of spatial interaction. Other kinds of spatial
interaction include journey to school, shopping trips, non-home based intraurban trips, intercity
population migration, choice of college or university by students, intercity freight movement,
telephone calls, Internet access, and many others.
Spatial interaction models are often classi®ed on the basis of the number and character of
constraints imposed on the predicted trip matrix. Constraints represent a priori knowledge about
the total interaction ¯ows entering and/or exiting a particular zone. For example, if the total
number of employed residents in each zone i(Oi ) is known exogenously, then the sum of predicted
¯ows leaving each zone is equal to Oi
X
Tij Oi ; 8i
1
j
where Tij is the ¯ow of commuters from zone i to zone j predicted by the model. Similarly if one
knows total employment in each zone (Dj ), one can impose that the sum of predicted commuter
¯ows ending in each zone is equal to Dj :
X
Tij Dj ; 8j:
2
i
If Eq. (1) holds for each origin zone then the model is said to be production constrained; if Eq. (2)
holds for each destination zone then the model is referred to as attraction constrained. If both
Eqs. (1) and (2) hold, the model is doubly constrained, while if neither of the two holds, the model
is unconstrained.
Trip distribution may be modeled with any number of constraints. Implementing the additional
constraints requires more a priori information. In turn, the reduction of degrees of freedom leads
to a more accurately predicted ¯ow matrix. It has been shown empirically that estimating spatial
interaction with a doubly-constrained model yields the most accurate results. See, for example,
Fotheringham and O'Kelly (1989) on interregional migration among the nine major census divisions in the United States, or Mozolin (1997) on commuting trips within metropolitan Atlanta.
Because of its higher accuracy in modeling trip distribution, the doubly-constrained model is a
proper baseline against which to evaluate neural networks.
The doubly-constrained spatial interaction model of journey to work can be formulated
mathematically as:
Tij Ai Oi Bj Dj f cij ;
3
where cij is the travel impedance (distance) from zone i to zone j, f(cij ) is a distance decay
function, Oi is the number of workers resident in zone i (production), Dj is the number of workers
56
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
in zone j (attraction), and Ai and Bj are balancing coecients ensuring that Eqs. (1) and (2) are
satis®ed:
1
4
Ai P
Bj Dj f cij
j
and
1
Bj P
:
Ai Oi f cij
5
i
Two alternative speci®cations of the distance decay function f(cij ) will be used here: the negative
power function cÿb
ij b P 0, and the negative exponential function exp ÿbcij b P 0.
Of all the methods suggested to calibrate spatial interaction models (see, for instance, Bacharach (1970), Batty (1976), Evans (1971), Fotheringham and O'Kelly (1989), Ortuzar and
Willumsen (1994), Wilson (1970)), we choose to use a maximum-likelihood estimation (MLE)
approach. Batty and Mackie (1972) have shown that likelihood maximization boils down to
solving a non-linear equation. With a power distance function, this equation is given by
XX
XX
Tij ln cij
Tij ln cij
6
i
j
i
j
where Tij is the actual number of commuters from zone i to zone j, and Tij is estimated by
Eqs. (3)±(5). Mutatis mutandis with an exponential distance function. The SIMODEL computer
code (Williams and Fotheringham, 1984) is used to derive parameter estimates.
3. Multilayer perceptron neural networks applied to the journey to work problem
Background of multilayer perceptron neural networks is presented below before we proceed
with their application to the journey to work problem. The multilayer perceptron neural network
is one of a variety of parallel computing techniques that conceptually mimic structures and
functions of human central neural systems. The model used in this study is a three-layer fullyconnected feedforward NN which consists of input nodes representing independent variables (the
productions, the attractions, and the travel impedances), hidden nodes, and one output node for
the dependent variable, namely the ¯ow Tij from zone i to zone j. See Fig. 1 for the architecture of
a NN with four hidden nodes. Each input node corresponds to an independent variable in the ¯ow
model while the dependent variable Tij is the output node. The network output (activation) z is
obtained by a double logistic transform of the weighted sum of inputs. The reader is referred to
Haykin (1998), Rojas (1996), or Smith (1993) for an in-depth coverage of the NN methodology.
The most valuable property of multilayer feedforward NNs is their ability to approximate a
desired function from training examples. In fact, a three-layer fully connected feedforward NN
with n input nodes, a sucient number of hidden nodes, and one output node can be trained to
approximate an n to 1 mapping function of arbitrary complexity (Kreinovich and Sirisaengtaskin,
1993). Learning of network weights often proceeds by backpropagation of errors (Rumelhart
et al., 1986) so as to minimize the total error for all examples in a training set.
The backward pass is a recursive procedure (Rumelhart et al., 1986) during which the partial
derivatives of all network weights with respect to the error used to adjust the weights up or down
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
57
Fig. 1. Fully-connected three-layer perceptron neural network.
to reduce total error. In this research, we use an o-line, or epoch-based learning: network weights
are adjusted only after all examples in the training set have been processed. Several non-linear
optimization methods are available to ®nd a set of weights that minimizes the error on all examples in the training set. This study uses the Quickprop algorithm developed by Fahlman (1989).
Though it does not guarantee a global optimum, its quick convergence dramatically increases the
speed of NN training. The gradient descent method (Rumelhart et al., 1986) is applied in those
rare instances where Quickprop cannot be used.
Backpropagation neural networks easily ®t into the framework of the doubly-constrained
spatial interaction model. The network learns the mapping function that best ®ts the relationship
between dependent variables (production, attraction, and travel impedance) and the independent
variable (¯ows). Interestingly, the mapping function is no longer restricted to either power or
exponential functional form as in the conventional models. Nor is it explicitly speci®ed as a linear
or non-linear regression model. The major advantage of the NN approach is that it is ¯exible
enough to model non-linear relationships of arbitrary complexity in an automated fashion.
As noted by Black (1995), Fischer and Gopal (1994), and Gopal and Fischer (1996), a NN may
perform well enough to estimate actual spatial interaction ¯ows, but small deviations are bound
to remain. Furthermore, the network itself does not contain any mechanism to enforce the origin
and destination constraints. Consequently, the origin and destination totals derived by summing
the ¯ows predicted by the model are usually not equal to the actual origin and destination totals.
58
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
We use a standard iterative proportional ®tting procedure (Slater, 1976) to enforce these constraints. After this post-processing, NN ¯ows give predictions comparable to those of a doublyconstrained model.
Network training is realized with NevProp 1.16 (Goodman, 1996). The Quickprop algorithm is
embedded in NevProp, but pre- and post-processing (including scaling and enforcing production
and attraction constraints) are part of separate applications written by the authors.
4. Empirical analysis
4.1. Study Area and Data Sources
We use 1980 and 1990 journey-to-work commuter ¯ows in the Atlanta Metropolitan Area.
Commuter ¯ows among the 15 counties of Atlanta SMSA for 1980 are available from the 1980
U.S. Census (Bureau of the Census, 1983). Commuter ¯ows among the 20 counties of the Atlanta
MSA (Fig. 2) for 1990 are available in the Census Transportation Planning Package (CTPP)
(Bureau of Transportation Statistics, 1993). Data sets on commuting ¯ows between census tracts
in 1980 and 1990 were kindly made available to the authors by the Atlanta Regional Commission.
There were 345 census tracts in 1980, and 507 census tracts in 1990 in the study area.
Fig. 2. Counties of Atlanta Metropolitan Statistical Area (MSA) in 1990.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
59
The logistics of spatial interaction modeling requires a clearly de®ned region with no, or small,
¯ows across its border. In the case of Atlanta, this assumption is not grossly violated. In 1980,
slightly more than 90% of working residents of the Atlanta SMSA also worked inside the SMSA.
In 1990, 95.3% of working residents of the Atlanta MSA worked inside the MSA. Also, 92.9%
employed in the Atlanta MSA also lived within the region in 1990.
Spatial separation between commuting zones (counties or census tracts) is measured by the
straight-line (Euclidean) distance between zone centroids in the metropolitan area. Setting intrazone distance to zero is known to generate systematic measurement errors. It is common
practice to correct for this error by de®ning the distance from a zone to itself as a quarter of the
distance from the zone centroid to the centroid of its nearest neighbor (Thomas and Hugget,
1980). The Euclidean distance is only an approximation of the perceived impedance between
home and work locations. We recognize that road distance, travel time, or a generalized travel
cost function may dierently aect the predictive accuracy of the MLE and NN models. However,
since the major thrust of this study is to compare and evaluate two forecasting techniques, the
relative accuracy of their estimates matters more than their absolute accuracy and the approximation given by Euclidean distance is acceptable. 2
4.2. Implementation issues
The journey-to-work analysis is structured as follows. First, two doubly-constrained spatial
interaction models are calibrated by MLE on 1980 travel data. One model uses county-level data,
the second uses census tract-level data. Each model is then employed to forecast interzonal
commuter ¯ows (intercounty or intertract) for year 1990 with the calibrated distance decay parameter b and 1990 working population and employment data at the corresponding geographic
resolution (Oi and Dj marginal totals) for production and attraction, respectively. Forecasted
¯ows are compared to actual 1990 trip data using four goodness-of-®t measures (the absolute
error (AE), the standardized root mean square error (SRMSE), Kulback's w statistic, and the
R-square. See Fotheringham and Knudsen (1987), Weiss (1995), and others for a description of
the statistics. In most cases, these measures are highly consistent. Hence, only the AE and SRMSE
measures are reported for the tested models hereunder.
In parallel, two sets of NN spatial interaction models are trained and validated on the same
1980 travel data-one on county-level ¯ows, the other on census tract ¯ows. With the network
weights for which the validation error is minimum and 1990 population and employment data, the
NNs predict 1990 interzonal commuter ¯ows at the county and census tract levels respectively
(test sets). Goodness-of-®t of these forecasts to actual ¯ows is once again measured. Finally, the
relative performance of MLE and NN spatial interaction models in predictive mode is assessed by
comparing goodness-of-®t measures.
2
In fact, preliminary test results in Mozolin (1997) indicate that the goodness-of-®t of an MLE doubly-constrained
model is enhanced by substituting average reported travel time for Euclidean distance, and vice versa for the NN
formulation. Consequently, if the use of the Euclidean distance as a proxy for spatial impedance biased our test, it is
most likely to be in favor of the NN model, which places our conclusions of the conservative side.
60
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Selecting a NN con®guration and parameters suitable for a certain problem is often a challenging task. The ®rst stage of backpropagation feedforward NN model design entails setting the
topology of the model. A natural topology for a doubly-constrained spatial interaction problem
involves three inputs (the number of resident workers in the zone of origin, employment in the
destination zone, and the spatial impedance between the zones) and one output (the number of
commuters). It is common practice to proceed by trial and error to select the number of hidden
nodes, and to test networks with hidden layers of varying size. Networks with 5, 20, and 50 hidden
nodes are tested in this study. Networks of larger sizes are impractical due to the excessive
computational requirement of their training. Each network con®guration is processed ®ve times,
each run starting with a random set of initial weights and a training set drawn randomly from the
full data set. We report results of experiments with dierent partitions of the full data set into
training and validation sets.
The NN model is further speci®ed as follows. Since, in most instances, weights are changed
according to the Quickprop rule, no momentum term is needed, and the learning rate must be
speci®ed only for use with the gradient descent method. 3 A 0.1 learning rate is used throughout
the analysis. Experiments with dierent rates lead to remarkably similar weight estimates and
learning speeds. Initial weights are randomly drawn from a uniform distribution within the range
[ÿ0.01, +0.01].
All three network inputs are scaled by dividing the value observed for each example by the
input's maximum value in the set. Whereas input scaling is optional, scaling of the output is
required for successful learning. Scaling to ®t the output within the [0.1, 0.9] range is usually used.
However, because the networks are tested on data other that those used for training and validation, and that total ¯ows have increased between base and prediction years, the interval is
scaled to 0.75. Therefore, the output (the number of commuters) is scaled using a linear transformation to have 1980 ®t the [0.25, 0.75] range. This transformation is given by
Tij
Tijnetwork 0:25 0:5
;
7
Tmax1980
where Tijnetwork is the output as seen by the network, and Tmax1980 is the maximum commuter ¯ow in
1980. At the testing stage, Eq. (7) is used in reverse.
All networks are trained for a maximum of 100 000 iterations. Many neural network practitioners allow for an early stopping of the feedforward backpropagation algorithm (Sarle, 1996) in
order to prevent over®tting. It is critical to realize that the error on the validation set is not a good
estimate of network generalization. A stopped network is tested on an independent test set that
has never been used for training to give an unbiased estimate on the network performance. We
train and validate all networks on the 1980 data, while the testing is accomplished on the 1990
data.
At the county level, a total of 225 data vectors are available. For each network processed, the
training set is formed by randomly selecting 112 vectors without replacement, while the remaining
113 vectors are used for validation. In one experiment, the full set of vectors is used both for
3
Fahlman (1989) also suggests adding a small constant (known as the sigmoid prime oset) to the error derivative to
avoid very slow learning when z is close to 0 or 1. The recommended value is set 0.1. The Quickprop Maximum Growth
Factor is to 1.75, as suggested by Fahlman (1989).
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
61
training and validation. The network weights that minimize validation error serve to test the
model on the 400 interactions from the 1990 trip matrix.
At the census-tract level, the training set is selected by simple random sample without replacement of 200 examples from the 121 104 origin-destination pairs in the 1980 tract-to-tract trip
matrix. 4 Similarly for the validation set. The optimal set of network weights is then tested on all
257 049 vectors from the 1990 tract-level trip matrix.
5. Results of performance comparison
5.1. Baseline spatial interaction models
The results of calibrating and testing maximum likelihood doubly-constrained models of
journey to work at the county and census-tract levels are presented in Table 1. The overall performance of these models with an exponential function of spatial deterrence is sucient to provide
a benchmark against which to evaluate the performance of NN models. 5
Of the two distance decay functions, better performance is exhibited by the negative exponential function in terms of all four goodness-of-®t measures, and at both aggregation levels
(county and census tract). This is consistent with the widespread consensus that the exponential
function is more appropriate for analyzing short distance interactions, such as those that take
place within an urban area, while the power function is more appropriate for analyzing longer
distance interactions such as interstate migration ¯ows (Fotheringham and O'Kelly, 1989).
Table 1 clearly shows that county-level models are more accurate than models applied at the
census-tract level. Lower model performance or ®t at a more detailed geographic scale is not
unusual. This phenomenon is known in the spatial-analytic literature as the modi®able areal unit
problem (Openshaw, 1984). Treatment of the issue in the context of spatial interaction modeling
can be found in Amrhein and Flowerdew (1989), Batty and Sikdar (1982), and others. In substance, if a simple functional relationship with a single parameter, like the doubly-constrained
spatial interaction model, presents some diculties in accounting for all subtleties of 400 interactions in the 20-country trip matrix, it is more exacerbated with the 507-tract trip matrix.
Consequently, if NNs truly outperform MLE spatial interaction models, one would anticipate the
advantage to be much larger with census tracts that with county data. Along the same line, in case
NNs were not to perform as well as MLE models with the latter data, the reverse statement could
reasonably be expected for commuter ¯ows between census tracts.
5.2. Neural network models
The results of the NN training and testing on county-level data are presented in Table 2. All
®ve sets of networks exhibit good to very good ability to predict 1990 commuter ¯ows in Atlanta.
4
Tests with training sets of 1000 cases do not show better predictive accuracy than with 200 cases, while
computational time becomes prohibitively high.
5
Similar, though slightly less good results are obtained with a power function of distance.
62
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Table 1
Maximum likelihood doubly-constrained models calibrated using 1980 ¯ows, and tested on 1990 commuter ¯ows
Distance decay parameter
(b)
Absolute error (AE)
(%)
SRMSE
County level
Exponential function of distance decay
ÿ8.43 ´ 10ÿ5
24.7
0.866
Census-tract level
Exponential function of distance decay
ÿ12.17 ´ 10ÿ5
68.7
2.492
Table 2
Neural network models trained using 1980 county-to-county commuter ¯ows, and tested on 1990 commuter ¯ows
Instance
Absolute error (%)
SRMSE
(a) Five-node networks
1
2
3
4
5
Average
42.6
39.5
41.0
44.0
40.2
41.5
1.877
1.182
1.466
1.762
1.642
1.586
2500
1000
3500
13 500
8000
(b) Twenty-node networks
1
2
3
4
5
Average
47.0
41.8
50.6
32.7
52.6
44.9
1.890
1.806
1.788
1.077
1.655
1.643
500
4500
21 000
9500
500
(c) Fifty-node networks
1
2
3
4
5
Average
51.2
50.5
28.8
56.6
42.1
43.8
1.936
1.773
0.847
2.396
1.683
1.722
31 500
500
1500
40 500
9 000
0.920
0.878
0.962
1.151
1.080
0.998
100 000
100 000
100 000
100 000
100 000
(d) Twenty-node networks trained on full 1980 data
1
30.9
2
30.1
3
36.2
4
41.9
5
39.7
Average
35.8
Epoch network was stopped
Comparison of the ®rst four sets of results reported in Table 2 indicates that, except for a slight
tendency for performance to drop as more hidden nodes are used, there is no signi®cant impact of
the number of hidden nodes in a network on goodness-of-®t. It is also noteworthy that all better
performing networks were stopped after less than 10 000 epochs.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
63
The bottom part of Table 2 displays the results of testing ®ve 20-node NNs trained on the
entire 1980 data set (225 training cases). As expected, they perform rather consistently and
somewhat better than networks trained on half the available interaction pairs. Training a neural
network on the full set of cases is usually not a recommended practice because it promotes
over®tting, a point evidenced here by the failure of all ®ve networks to converge before 100 000
training epochs. On the other hand, training data is less sparse than with a sample of cases, and
the network's power to generalize input±output relationships is enhanced.
Neural networks trained and tested on tract-level ¯ows perform signi®cantly worse that those
trained on county-level data (Table 3). The best NN model produces errors that are 83.6% of 1990
commuter ¯ows, while the worst model hits a whopping 119.1% absolute error. The faint inverse
relationship between model performance and number of hidden nodes detected above is now
clearly marked. Average goodness-of-®t measures drop with the increase of hidden nodes from 5
to 50.
5.3. Neural Network versus MLE Models
None of the NN models tested on county commuter ¯ows outperforms the corresponding MLE
doubly-constrained model. The only partial exception to this rule is the case of one 50-node NN
model (#3 in Table 2(c)) which shows a better performance as measured by SRMSE and
Table 3
Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested on 1990 commuter ¯ows with
training and validation sets selected randomly
Instance
Absolute error (%)
SRMSE
Epoch network was stopped
(a) Five-node networks
1
2
3
4
5
Average
90.4
83.6
89.0
108.3
96.6
93.6
3.265
3.226
3.280
4.072
3.701
3.509
12 000
6500
5500
500
1200
(b) Twenty-node networks
1
2
3
4
5
Average
92.4
105.3
119.1
105.9
97.5
104.0
3.451
4.000
5.328
4.009
3.760
4.110
3000
1000
10 500
98 500
4000
(c) Fifty-node networks
1
2
3
4
5
Average
109.0
109.8
111.0
90.2
101.6
104.3
4.048
4.106
4.218
3.418
3.845
3.927
500
500
79 000
16 000
4000
64
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
R-square, but not according to the other two statistics. Even more remarkable, the ®ve NNs
trained on the entire 1980 data set (Table 2(d)) still fail to surpass the MLE model in any run,
though they come closer to challenging its superiority.
At the census-tract level, the comparison is even more favorable for the MLE model. All runs
of the NN models (Table 3) trail far behind the MLE model (Table 1). The best of twelve NN
models misallocates 83.6% of all commuter ¯ows, while the conventional doubly-constrained
model with the negative exponential function of distance misallocates ``only'' 68.7% of ¯ows.
It is appropriate to stress here again that model performance is evaluated in a predictive mode,
that is by the capacity of a model to predict interaction ¯ows for a horizon other than the base
year used in training and validation. In fact, performance measured on base-year data would lead
to opposite conclusions, thus supporting the existing literature in the matter. For information
purposes, performances of MLE and NN models trained and tested on the 1990 county-to-county
¯ows are reproduced in appendices A and B, respectively.
By all accounts, the evidence reviewed above that neural networks show inferior predictive
performance over conventional statistical models is quite puzzling and unexpected. Neural networks are indeed regarded as good approximators (Kreinovich and Sirisaengtaskin, 1993). The
data analysis calls for further research to pinpoint the causes of their underperformance. In order
to trace potential patterns of consistent underprediction or overprediction by NN models, we use
three-dimensional plots of observed and predicted data. Each plot displays ¯ows originating from
a given county against distance and number of workers at destinations. Such plots for a sample of
four counties, namely Clayton, Cobb, DeKalb, and Fulton counties, are depicted in Fig. 3 for
1990. Corresponding ¯ow surfaces generated by the ®ve tested instances of 20-node neural networks (see Table 2-b) are given in Fig. 4.
At examination, the predicted surfaces in Fig. 4 reveal unsuspected structures dominated by a
wavy pattern of troughs and ridges. These structures are particularly pronounced in instance three
(Fig. 4(c)), which also happens to be among the instances that predicts 1990 ¯ows with the least
overall accuracy. This pattern is often symptomatic of over®tting due to excessive training of the
network. That this network was trained longer than any other 20-node network suggests that it
learned the noise in the training set in addition to the underlying function we want it to ®nd. As a
result, its ability to generalize is rather poor and its prediction accuracy is low particularly where
training data are sparse (interpolation problem). Another feature common to several underperforming network instances in Fig. 4 is the consistent underestimation of the largest 1990 expected
¯ows (Figs. 3 and 4). 6 Networks fail to extrapolate around and beyond the limits of the training
sample. Possible explanations for interpolation and extrapolation errors are now pursued.
The spurious ridges and troughs that show throughout predicted surfaces suggest that over®tting may be occurring, in spite of the early stopping mechanism put in place to prevent it. In our
commuter trip distribution problem, we can postulate that the occurrence of over®tting is tied to
an excessive number of hidden nodes in the network. Our problem may in fact be simple enough
to require less than ®ve hidden nodes. After all, conventional spatial interaction models perform
6
Most remarkable in this matter are instances 2 and 3, for which the training set does not include the Fulton-Fulton
¯ow.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
65
Fig. 3. Actual number of commuters in 1990 as a function of distance between county of residence and country of work,
and the number of jobs in the country of work. All variables are measured as proportions of maximum 1980 values. The
``Number of workers'' axis is scaled logarithmically. Four countries of residence: a, Clayton Country; b, Cobb Country;
c, DeKalb Country; d, Fulton Country
well with a single parameter. Such a neural network could be devoid of spurious ridges and
troughs, and generalize just right.
To test the proposition that the predictive performance of NN models is improved by reducing
the number of hidden nodes, various neural networks with one to three hidden nodes are tested on
1980 county-level commuter ¯ows and validated on 1990 ¯ows. Results are summarized in
Table 4. Networks with fewer hidden nodes suer less from spurious troughs and ridges on their
prediction plots, and therefore, are less prone to over®tting. In fact surfaces generated by onehidden-node networks do not exhibit any spurious feature (Fig. 5). Such networks no longer
model the noise in the training data because they are unable to produce complex surfaces. This
does not translate however into unequivocally better goodness-of-®t with validation data for
sparser networks. Furthermore, none of the networks tested with a reduced number of hidden
66
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Fig. 4. 1990 commuter ¯ows from Fulton Country predicted by the 20-node neural networks listed in Table 2; a, ®rst
instance; b, second instance; c, third instance; d, fourth instance; e, ®fth instance. All variables are measured as proportions of their maximum 1980 values. The ``Number of workers'' axis is scaled logarithmically.
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
67
Table 4
Neural network models with few hidden nodes. Trained using 1980 county-to-county commuter ¯ows, and tested using
1990 commuter ¯ows
Instance
Absolute error (%)
SRMSE
Epoch network was stopped
(a) One-node networks
1
2
3
4
5
Average
38.1
38.1
35.6
41.8
42.9
39.3
1.592
1.245
1.666
1.434
1.480
1.483
2500
1500
27 000
500
1000
(b) Two-node networks
1
2
3
4
5
Average
32.5
27.7
64.8
46.0
60.6
46.3
1.276
0.962
2.387
2.002
2.262
1.778
3000
37 000
51 000
1000
20 000
(c) Three-node networks
1
2
3
4
5
Average
31.1
51.9
41.9
43.6
38.0
41.3
1.248
2.143
1.904
2.021
1.272
1.718
500
500
2000
1000
1000
nodes (Table 4) succeeds in outperforming the MLE doubly-constrained model with exponential
function of distance (Table 1). A straightforward consequence is that lower performance of neural
networks cannot be imputed to over®tting and cannot be remediated easily by modifying the
topology of the networks.
The fact remains that neural networks have a limited ability to interpolate spatial interaction
data in a predictive mode. Paradoxically, the cause of this weakness may also be the essence of its
strength in validation on contemporary data, namely the inherent ¯exibility to approximate
complex data structures with great accuracy. In short, the poor ®t of neural networks on prediction-year data (1990) can be blamed on their unrivaled ®t to base-year data (1980). According
to this view, neural networks are such good approximators that they model not only interaction
data structures, but also the context of the transportation systems within which commuter patterns take place. By design, spatial interaction neural networks are context-dependent models
whose parameters do not transfer well to other contexts. The extent of NN context sensitivity
remains a subject for future study. A solution to this problem may come from the explicit incorporation of context dependencies in the network representation. Evidence in Table 4 suggests
that model transferability is problematic even for sparse model topologies.
It is our contention that the sigmoid form of network output limits the ability of neural networks to extrapolate interaction data in a meaningful way. Sigmoid output nodes tend indeed to
generate S-shaped predicted surfaces that are ill-suited to model spatial interaction behavior. For
68
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Fig. 5. 1990 commuter ¯ows from Fulton Country predicted by a single-node neural network. All variables are
measured as proportions of maximum 1980 values.
illustration purposes, let us compare how ¯ows predicted by the NN and conventional gravity
models respond to distance as the other two input variables are held constant. Most NN ¯ow
surfaces (Figs. 4 and 5) have in common an S-shaped pro®le of dependence between ¯ow volume
and distance. This pro®le implies that, all other things being equal, the marginal ¯ow increase with
respect to distance is small and declining, sometimes even negative. On the contrary, observed
patterns (Fig. 3) show no tapering in the relationship between ¯ow volume and distance. Consequently, ¯ow extrapolation on the S-shaped pro®le is highly inaccurate. Because enough of the
1990 ¯ow data fall outside of the range of the 1980 training data, the overall performance of the
network is generally poor. A signi®cant implication is that conventional feedforward backpropagation NNs may not exhibit the right properties for use in the application domain of trip distribution. Other NN models that do not assume a sigmoidal activation function ± such as the
Gaussian Radial Basis Function model (Verleysen and Hlavackova, 1994) ± may prove better
suited for spatial interaction problems.
In contrast to neural networks, the smooth surface generated by the MLE model with a negative exponential function of distance decay provides a better ®t to the empirical data. A good ®t
is achieved not only for the data on which the model was calibrated, but also for the unseen data
beyond the training range. This indicates that the maximum likelihood model is a better
extrapolator than the neural network, and a better tool for urban and regional planning. A
fundamental reason for better performance of the maximum-likelihood model is that, being a oneparameter model, it generalizes more than neural networks and, consequently, is more context
independent. Also contributing towards better performance is its derivation from the ®rst principles, whereas the NN approach is purely data-driven. Wilson (1970) showed in his seminal work
how the exponential distance decay function is derived from the entropy principle, by ®nding the
69
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
most likely trip matrix given the origin and destination totals and the total distance traveled in the
system. The principle of maximum likelihood applies to all trip matrices, regardless of their use
for model calibration or model testing; hence the better extrapolation capability of the maximumlikelihood model.
5.4. Geographic scale problem
The dramatically lower performance of neural networks on tract-level data suggests that additional factors are at work at this scale. The vast majority of commuter ¯ows in tract-level trip
matrices are zero (82.9% of all ¯ows in the 1980 matrix, and 82.5% in the 1990 matrix), while most
non-zero ¯ows are fairly small. With only a small fraction of ¯ows signi®cantly larger than the
rest, small random samples of training examples have little chance to include large ¯ows. As a
result, networks trained on random samples of examples primarily learn how to predict zero and
very small ¯ows. Since we established earlier that neural networks are rather poor extrapolators of
spatial interaction ¯ows, their predictions of larger ¯ows is highly inaccurate. Hence the low
overall performance of neural networks on small analysis zones.
Resorting to larger samples (say, more than 1000 cases), or even to the entire population of
samples, is not a practical solution because it leads to unacceptably long training. An appealing
alternative consists in using strati®ed random sampling instead of uniform random sampling in
order to represent ¯ows of all sizes in the training set. The eectiveness of this strategy is now
assessed with two distinct strati®ed sampling schemes.
In strategy I, 20 examples of zero ¯ows and 180 examples of non-zero ¯ows are selected randomly without replacement from 121,104 interactions in 1980. In strategy II, we select 10 examples of zero ¯ows, and 10 randomly-selected examples from each bin of origin-destination pairs
Table 5
Neural network models trained using 1980 tract-to-tract commuter ¯ows, and tested using 1990 commuter ¯ows with
training and validation sets selected using strati®ed random sampling
Instance
Absolute error (%)
SRMSE
Epoch network
was stopped
Sampling strategy I; Five-node networks
1
2
3
4
5
Average
90.5
91.0
94.7
90.8
94.6
92.3
3.407
3.673
3.766
3.416
3.502
3.553
8000
1000
500
1500
2000
Sampling strategy II; Five-node networks
1
2
3
4
5
Average
93.4
91.0
90.3
90.9
93.4
91.8
3.611
3.565
3.540
3.564
3.613
3.578
1000
500
500
500
1000
70
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
de®ned by 10-unit increments on the ¯ows. In both strategies, validation sets are selected similarly. The testing results for a 5-node network are presented in Table 5.
Comparison of these goodness-of-®t results to those of the ®ve-node network trained on a
simple random sample Table 3 reveals no signi®cant improvement. The ®ve-node networks with
training and validation sets selected using strati®ed random sampling have an average absolute
error of 92.3% for sampling strategy I, and 91.8% for sampling strategy II, against 93.6% with
uniform random sampling. This piece of evidence suggests that using strati®ed random sampling
instead of uniform random sampling to select the training set does not improve the accuracy of
NN spatial interaction models. More complex strati®cation strategies may produce better results,
but we leave this investigation for the future.
6. Conclusions
This study compared the performance of multilayer perceptron neural networks and maximumlikelihood doubly-constrained models for commuter trip distribution. Our experiments produced
overwhelming evidence that NN models may ®t data better but their predictive accuracy is poor in
comparison to that of maximum-likelihood doubly-constrained models. What our thorough study
failed to identify are perceptron model con®gurations that consistently exhibit a predictive performance that surpasses that of maximum-likelihood doubly-constrained models. It points to
several likely causes of neural network underperformance, including model non-transferability,
insucient ability to generalize, reliance on sigmoid activation functions, and their essence as
data-driven techniques. An agenda for future research is also proposed to explore the potential for
other perceptron formulations (i.e., spatial structure as NN input) and other neural networks
(RBF, for instance) to predict spatial interaction ¯ows with greater accuracy.
This conclusion is at variance with the existing literature which has been overly optimistic
about the advantages of modeling trip distribution by spatial interaction with backpropagation
neural networks. While neural networks may perform better than conventional models in modeling spatial interaction for the base year, they fail to outperform the MLE doubly-constrained
model for forecasting purpose, which is the motivation behind these modeling eorts in the ®rst
place. Therefore, current perceptron neural networks do not provide an appropriate modeling
approach to forecasting trip distribution over a planning horizon for which distribution predictors
(number of workers, number of residents, commuting distance) are well beyond their base-year
domain of de®nition.
Acknowledgements
The authors are grateful to Dr. Frank Koppelman. His insightful comments on an earlier
version of the manuscript were instrumental in enhancing its quality.
Appendix A
Maximum likelihood doubly-constrained models calibrated and tested using 1990 commuter
¯ows among the counties of the Atlanta MSA.
71
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Exponential function of distance decay
Distance decay
parameter (b)
Absolute error
(AE) (%)
SRMSE
ÿ7.64 ´ 10ÿ5
24.0
0.728
Appendix B
Neural network models trained and tested using 1990 county-to-county commuter ¯ows in
Atlanta
Instance
Absolute error (%)
SRMSE
Epoch network
was stopped
Five-node networks
1
2
3
4
5
Average
27.1
18.2
23.3
18.3
21.1
21.6
0.723
0.470
0.634
0.463
0.520
0.562
100 000
100 000
100 000
100 000
100 000
Twenty-node networks
1
2
3
4
5
Average
24.3
15.2
21.4
27.3
24.1
22.5
0.585
0.379
0.554
0.637
0.636
0.558
100 000
100 000
100 000
100 000
100 000
8.6
8.4
8.6
8.7
10.7
9.0
0.169
0.168
0.166
0.168
0.212
0.176
100 000
100 000
100 000
100 000
100 000
Fifty-node networks
1
2
3
4
5
Average
References
Amrhein, C.G., Flowerdew, R. (1989). The eect of data aggregation on a Poisson regression model of Canadian
migration. The Accuracy of Spatial Databases Goodchild, M., Gopal, S. pp. 229±238. Taylor and Francis, London.
Bacharach, M., 1970. Biproportional Matrices and Input-Output Change. Cambridge University Press, Cambridge.
Batty, M., 1976. Urban Modeling: Algorithms, Calibrations, Predictions. Cambridge University Press, Cambridge.
72
M. Mozolin et al. / Transportation Research Part B 34 (2000) 53±73
Batty, M., Mackie, S., 1972. The calibration of gravity, entropy, and related models of spatial interaction. Environment
and Planning A 4, 205±233.
Batty, M., Sikdar, P.K., 1982. Spatial aggregation in gravity models: 1. An information-theoretic framework.
Environment and Planning A 14, 377±405.
Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Oxford.
Black, W.R., 1995. Spatial interaction modeling using arti®cial neural networks. J. Transport Geography 3 (3), 159±
166.
Bureau of the Census, 1983. 1980 Census of Population and Housing, Census Tracts, Atlanta, GA. PHC80-2-77. US
Department of Commerce, Bureau of the Census, Washington.
Bureau of Transportation Statistics (1993) 1990 Census Transportation Planning Package. US Department of
Transportation, Bureau of Transportation Statistics. CD-Rom, Washington.
Dougherty, M., 1995. A review of neural networks applied to transport. Transportation Research C 3, 247±260.
Evans, A.W., 1971. The calibration of trip distribution models with exponential or similar cost functions.
Transportation Research 5, 15±38.
Fahlman, S.E., 1989. Faster-learning variations on back-propagation: An empirical study. Proceedings of the 1988
Connectionist Models Summer School Touretzky, D., Hinton, G., Sejnowski, T. (Eds). pp. 38±51. Morgan
Kaufmann, San Mateo.
Fischer, M.M., Gopal, S., 1994. Arti®cial neural networks: A new approach to modeling interregional telecommunication ¯ows. J. Regional Science 34, 503±527.
Fotheringham, A.S., Knudsen, D.C., 1987. Goodness-of-®t Statistics. CATMOG series. Geo Abstracts, Norwich.
Fotheringham, A.S., O'Kelly, M.E., 1989. Spatial Interaction Models: Formulations and Applications. Kluwer,
London.
Goodman, P.H., 1996. NevProp software, ver. 3. Reno, NV: University of Nevada, URL: http://www.scs.unr.edu/
nevprop/.
Gopal, S., Fischer, M.M., 1996. Learning in single hidden-layer feedforward network: Backpropagation in a spatial
interaction modeling context. Geographical Analysis 28, 38±55.
Haykin, S.S., 1998. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River.
Himanen, V., Nijkamp, P., Reggiani, A., 1998. Neural Networks in Transport Applications. Ashgate, Brook®eld.
Hua, J., Faghri, A., 1994. Applications of arti®cial neural networks to intelligent vehicle-highway systems.
Transportation Research Record 1453, 83±90.
Kikushi, S., Nanda, R., Perincherry, V., 1993. A method to estimate trip-O-D patterns using a neural network
approach. Transportation Planning and Technology 17, 51±65.
Kreinovich, V., Sirisaengtaskin, O., 1993. Universal approximators for functions and for control strategies. Neural,
Parallel, and Scienti®c Computations 1, 325±346.
Mozolin, M.V., 1997. Spatial interaction modeling with an arti®cial neural network Discussion Paper. Series 97-1,
Department of Geography, University of Georgia, Athens, GA.
O