Directory UMM :Data Elmu:jurnal:T:Transportation Research_Logistics & Transportation Review:Vol36.Issue4.Dec2000:

Transportation Research Part E 36 (2000) 155±172
www.elsevier.com/locate/tre

A comparison of the predictive potential of arti®cial neural
networks and nested logit models for commuter mode choice
David A. Hensher *, Tu T. Ton
Institute of Transport Studies, University of Sydney, Sydney NSW 2006, Australia
Received 12 January 1998; received in revised form 18 October 1999; accepted 3 November 1999

Abstract
Research in the ®eld of arti®cial intelligence systems has been exploring the use of arti®cial neural
networks (ANN) as a framework within which many trac and transport problems can be studied. One
appeal of ANN is their use of pattern association and error correction to represent a problem. This contrasts with the random utility maximisation rule in discrete choice modelling. ANN enables a full set of
human perceptions about a particular problem to be represented by arti®cial networks of neurons. A claim
of ANN is that it can tackle the problem of travel demand forecasting and modelling as well if not better
than the discrete choice approach. The use of such tools in studying individual traveller behaviour thus
opens up an opportunity to consider the extent to which there are representation frameworks which
complement or replace discrete choice methods. This paper explores the merits of neural networks by
comparing the predictive capability of ANN and nested logit models in the context of commuter mode
choice. Ó 2000 Elsevier Science Ltd. All rights reserved.


1. Introduction
Understanding and predicting traveller behaviour remains a complex activity. The set of tools
in common use by practitioners and many of the tools used by researchers exhibit complexity; yet
often this richness of detail is in methods of estimation rather than in representation of how
individuals actually evaluate alternatives and make decisions on a set of interrelated travel
choices.
Discrete choice methods, championed by the multinomial logit model and its variants such as
nested logit, heteroskedastic extreme value, mixed logit and multinomial probit, have added

*

Corresponding author. Tel.: +61-2-9351-0071; fax: +61-2-9351-0088.
E-mail addresses: davidh@its.usyd.edu.au (D.A. Hensher), Tut@its.usyd.edu.au (T.T. Ton).

1366-5545/00/$ - see front matter Ó 2000 Elsevier Science Ltd. All rights reserved.
PII: S 1 3 6 6 - 5 5 4 5 ( 9 9 ) 0 0 0 3 0 - 7

156

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172


substantial behavioural richness into statistical speci®cation and estimation (Hensher et al., 1999),
as they seek to accommodate the role of both observed and unobserved in¯uences on travel
choices. The search for behavioural and analytical enhancement continues.
Research in the ®eld of intelligence systems has been exploring the use of arti®cial neural
networks (ANN) (e.g. Davalo and Naim, 1991; Faghri and Hua, 1991; Yang et al., 1993) as a
framework within which many trac and transport problems can be studied. Notable applications are in trac control and scheduling of rail and air services. The use of such tools in studying
individual traveller behaviour opens up an opportunity to consider the extent to which there are
representation methods that complement or replace existing analytical approaches.
This paper explores the merits of neural networks as part of a revised framework within
which to explore the processes of traveller decision making, and how discrete choice methods
might be integrated within such a framework. The latter acknowledges the important role that
these tools have played in the last 25 yr in the development of better practice in travel demand
modelling.
The paper is structured around six sections. Section 2 is an overview of the empirical setting of
the travel choice experiment followed by a description of common variables and data sets selected
for contrasting the two modelling approaches: choice and ANN models. Section 3 describes the
speci®c choice-based models (i.e. nested logit models) in estimating commuter mode choice for
selected studies. In Section 4, the basic ANN concepts are presented followed by a description of
the speci®c structure of ANN for representing the same variables and data sets as the choice

model. Section 5 presents the predictive performance comparison between the choice models and
neural network models. The paper concludes with comments on the merits of neural networks and
choice models in the prediction task of travel demand models.

2. Empirical setting
2.1. Background of commuter choice studies
The case studies used in this research were extracted from a stated choice experiment. This
experiment was part of a broader research e€ort examining the potential impacts of transport
policy instruments on reductions in greenhouse gas emissions in six Australian capital cities:
Sydney, Melbourne, Brisbane, Adelaide, Perth and Canberra (Hensher et al., 1995; Louviere et al.,
1994). The universal choice set comprised the currently available modes plus the two ÔnewÕ modes
of light rail and busway. Respondents evaluated scenarios describing ways to commute between
their current residence and workplace locations using di€erent combinations of policy-sensitive
attributes and levels. The purpose of the exercise was to observe and model their observed coping
strategies in each scenario.
Four alternatives appeared in each travel choice scenario: (a) car (no toll), (b) car (toll), (c) bus
or busway, and (d) train or light rail. Twelve types of showcards described scenarios involving
combinations of trip length (3) and public transport pairs (4): bus vs. light rail, bus vs. train
(heavy rail), busway vs. light rail, and busway vs. train. The appearance of public transport pairs
in each card shown to respondents was based on an experimental design. Attribute levels are

summarised in Table 1 and an illustrative show card is displayed in Table 2.

Table 1
The set of attributes and attribute levels in the travel choice experiment (all cost items are in Australian dollars, all time items are in minutes)
Car toll rd

Public
transport

Bus

Train

Busway

Light rail

15, 20, 25

10, 12, 15


10, 15, 20

10, 15, 20

10, 15, 20

10, 15, 20

None

6±10, 6:30±8:30,
6:30±9

Total time in
the vehicle
(one-way)
Frequency of
service


Every 5, 15, 25

Every 5, 15, 25

Every 5, 15, 25

Every 5, 15, 25

None

1, 1.5, 2

Fuel cost
(per day)

3, 4, 5

1, 2, 3

Walk 5, 15, 25

Car/bus 4, 6, 8
Walk 5, 15, 25
Bus 4, 6, 8

Walk 5, 15, 25
Car/bus 4, 6, 8
Walk 5, 15, 25
Bus 4, 6, 8

Walk 5, 15, 25
Car/bus 4, 6, 8
Walk 5, 15, 25
Bus 4, 6, 8

Walk 5, 15, 25
Car/bus 4, 6, 8
Walk 5, 15, 25
Bus 4, 6, 8

Parking cost

(per day)
Time variability

Free, $10, $20

Free, $10, $20

Time from home
to closest stop
Time to
destination from
closest stop
Return fare

1, 3, 5

1, 3, 5

1, 3, 5


1, 3, 5

0, ‹4, ‹6

0, ‹1, ‹2

Medium (30±45 min)
Travel time to work

30, 37, 45

20, 25, 30

Total time in
the vehicle
(one-way)
Frequency of
service

20, 25, 30


20, 25, 30

20, 25, 30

20, 25, 30

Every 5, 15, 25

Every 5, 15, 25

Every 5, 15, 25

Every 5, 15, 25

Time from home
to closest stop
Time to
destination from
closest stop

Return fare

Walk 5, 15, 25
Car/bus 4,6,8
Walk 5,15,25
Bus 4,6,8

Walk 5, 15, 25
Car/bus 4,6,8
Walk 5,15,25
Bus 4,6,8

Walk 5, 15, 25
Car/bus 4,6,8
Walk 5,15,25
Bus 4,6,8

Walk 5, 15, 25
Car/bus 4,6,8
Walk 5,15,25
Bus 4,6,8

2, 4, 6

2, 4, 6

2, 4, 6

2, 4, 6

30, 35, 40

30, 35, 40

30, 35, 40

30, 35, 40

Short (45 min)
Travel time to work

45, 55, 70

30, 37, 45

Total time in
the vehicle
(one-way)

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Car no toll

157

158

Car no toll

Car toll rd

Public
transport

Bus

Train

Busway

Light rail

None

6±10, 6:30±8:30,
6:30±9

Frequency of
service

Every 5, 15, 25

Every 5, 15, 25

Every 5, 15, 25

Every 5, 15, 25

None

3, 4.5, 6

Fuel cost
(per day)

9, 12, 15

3, 6, 9

Walk 5, 15, 25
Car/bus 4, 6, 8
Walk 5,15,25
Bus 4,6,8

Walk 5, 15, 25
Car/bus 4, 6, 8
Walk 5,15,25
Bus 4,6,8

Walk 5, 15, 25
Car/bus 4, 6, 8
Walk 5,15,25
Bus 4,6,8

Walk 5, 15, 25
Car/bus 4, 6, 8
Walk 5,15,25
Bus 4,6,8

Parking cost
(per day)
Time variability

Free, $10, $20

Free, $10, $20

Time from home
to closest stop
Time to
destination from
closest stop
Return fare

3, 5, 7

3, 5, 7

3, 5, 7

3, 5, 7

0, ‹11, ‹17

0, ‹7, ‹11

Pay toll if you
leave at this time
(otherwise free)
Toll (one-way)

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Table 1 (Continued)

159

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172
Table 2
Example of the format of a travel choice experiment showcard
SA101

1. Car, toll
road

2. Car, nontoll road

Travel time to work
Time variability
Toll (one way)
Pay toll if you leave at this time
(otherwise free)
Fuel cost (per day)
Parking cost (per day)

10 min
None
$1.00
6±10 a.m.

15 min
None
Free
±

$1.00
Free
3. Bus

$3.00
Free
4. Train

Total time in the vehicle (one-way)
Time from home to your closest stop
Time to your destination from the
closest stop
Frequency of service
Return fare

10 min
Walk 5 min
Walk 5 min

Car/bus 4 min
Bus 4 min

Every 5 min
$1.00

10 min
Walk 5 min
Walk 5 min

Car/bus 4 min
Bus 4 min

Every 5 min
$1.00

Five three-level attributes were used to describe public transport alternatives: (a) total in-vehicle
time, (b) frequency of service, (c) closest stop to home, (d) closest stop to destination, and (e) fare.
The attributes of the car alternatives were: (a) travel times, (b) fuel costs, (c) parking costs, (d)
travel time variability, and for toll roads (e) departure times and (f) toll charges. The design allows
orthogonal estimation of alternative-speci®c main e€ect models for each mode option: (a) car no
toll, (b) car toll road, (c) bus, (d) busway, (e) train, and (f) light rail.
The master design for the travel choice task was a 27  327 orthogonal fractional factorial,
which produced 81 scenarios or choice sets. The 27 level factor was used to block the design into
27 versions of three choice sets containing two alternatives. Versions were balanced such that each
respondent saw every level of each attribute exactly once. The 327 portion of the master design is
an orthogonal main e€ects design, which permits independent estimation of all e€ects of interest.
Two 2-level attributes were used to describe bus/busway and train/light rail modes, such that bus/
train options appear in 36 scenarios and busway/light rail in 45.
2.2. Description of common variables and data sets selected for contrasting the choice and ANN
modelling approaches
Sydney, Melbourne and the pooled cities (combined Sydney and Melbourne) were selected for
the comparative studies. Each data source was split into two sub-data sets: training and testing
(see Table 3). Training data were used to feed into both the choice and ANN models for estimation. The testing data were used to test both models to establish testing generalisation or
predictive capability of models.
The arrangement of data sets for comparing choice and ANN models is shown in Table 4.
Three choice models and three ANN models were estimated for Sydney, Melbourne and combined Sydney and Melbourne. Both choice and ANN models were trained/estimated with the

160

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Table 3
List of data sources, their type and sample sizes selected for contrasting the choice and ANN models
Data set

Code name

Data sources

Type

Number of observations

1
2
3
4
5
6

Syd_Train
Syd_Test
Mel_Train
Mel_Test
SydMel_Train
SydMel_Test

Sydney
Sydney
Melbourne
Melbourne
Combined Sydney and Melbourne
Combined Sydney and Melbourne

Training
Testing
Training
Testing
Training
Testing

329
82
312
78
641
160

Table 4
Matrix of models and associated data sets used in estimating and testing of both choice and ANN models
Model training/estimation
City

Sydney
Melbourne
Pooled cities

Data set

Syd_Train
Syd_Train
Mel_Train
Mel_Train
SydMel_Train
SydMel_Test

Model testing
Model

Choice Model
ANN Model
Choice Model
ANN Model
Choice Model
ANN Model

Data sets
Self

Sydney

Melbourne

Pooled cities

Syd_Train
Syd_Train
Mel_Train
Mel_Train
SydMel_Train
SydMel_Train

Syd_Test
Syd_Test
Syd_Test
Syd_Test
Syd_Test
Syd_Test

Mel_Test
Mel_Test
Mel_Test
Mel_Test
Mel_Test
Mel_Test

SydMel_Test
SydMel_Test
SydMel_Test
SydMel_Test
SydMel_Test
SydMel_Test

Table 5
Common variables and alternatives selected for contrasting the choice and ANN models
Variable

Alternative

Cost ($)
Linehaul time (min)
Parking cost ($)
Access and egress time (min)

All
All
DA, RS
BS, TN, LR, BW

same associated data sets. For example, the SydTrain data set was used by both choice and ANN
models in modelling travel behaviour for Sydney.
Table 5 provides a list of variables that were used as common variables by both the choice and
ANN models. The six alternatives in the universal choice set are drive alone (DA), ride share (RS),
bus (BS), busway (BW), train (TN) and light rail (LR).

3. Choice modelling approach to commuter choice
Nested logit models were estimated for Sydney, Melbourne and the pooled cities (see Fig. 1).
Nested structures for Sydney and combined Sydney±Melbourne are DA versus the rest (see

161

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Fig. 1. Nested logit models structures for Sydney, Melbourne and the combined Sydney and Melbourne cities. (a)
Structure of Sydney and combined Sydney and Melbourne Models. (b) Structure of MelbourneÕs Model. (DA: Drive
alone; RS: Ride share; BS: Bus; TN: Train; BW: Busway; LR: Light rail.)

Fig. 1a); nested structure for Melbourne is DA, RS versus the rest (see Fig. 1b). These structures
evolved as the bet ®t versions from a series of hierarchies.
The results are summarised in Table 6. All three models provide statistically signi®cant e€ects
for in-vehicle cost, parking cost, linehaul time and public transport access plus egress time.
In searching for an appropriate model for each market, we found some variations in the
speci®cation of the taste weights; in particular the Melbourne model treats linehaul time as generic
Table 6
Summary of nested logit training models
Variable

Alternative

Syd±Mel

Sydney

Melbourne

Cost ($)
Linehaul time (min)
Linehaul time (min)
Parking cost ($)
Linehaul time (min)
Access and egress time (min)
Car drive alone constant
Ride share constant
Bus constant
Train constant
Light rail constant
Inclusive value
Inclusive value

All
All
DA, RS
DA, RS
BS, TN, LR, BW
BS, TN, LR, BW
DA
RS
BS
TN
LR
DA
RS, BS, TN, LR,
BW
DA, RS
BS, TN, LR, BW

)0.59985 ()7.08)

)0.52395 ()4.88)

)0.78084 ()4.75)
)0.05858 ()3.0)

)0.07258 ()5.17)
)0.10589 ()6.11)
)0.06809 ()4.80)
)0.04082 ()5.37)
1.7512 (3.80)
0.8273 (1.85)
)0.11982 ()0.55)
0.24967 (1.16)
0.38893 (2.23)
0.58122 (5.29)
0.39789 (2.13)

)0.06759 ()4.06)
)0.08268 ()3.64)
)0.08708 ()4.29)
)0.03625 ()3.52)
0.6815 (1.34)
0.02613 (0.05)
)0.10222 ()0.35)
0.10207(0.35)
0.26275 (1.08)
0.83055 (4.57)
0.84010 (3.17)

Inclusive value
Inclusive value
Sample size
Log-likelihood at
convergence
Adjusted pseudo-R2

)0.11552 ()3.65)
)0.03759 ()2.95)
2.6909 (3.14)
2.1230 (2.42)
)0.11818 ()0.33)
0.47955 (1.44)
0.55353 (2.12)

641
)1165.34

329
)367.81

0.7958 (4.30)
0.5291 (2.02)
312
)336.78

0.389

0.371

0.382

162

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

across all modes whereas the other two markets distinguish car and public transport. The nested
structure is also di€erent for Melbourne. We found that car drive alone is partitioned from the
other modes for Sydney and the combined cities; suggesting that the unobserved in¯uences on
choice are more similar between all public transport modes including ride share; whereas for
Melbourne the unobserved e€ects are similar within the drive alone and ride share alternatives.
The taste weights for the inclusive value variables are all statistically signi®cant and lie within the
0±1 range, the latter a requirement for the model form to be globally consistent with random
utility maximisation. The overall goodness-of-®t of the models is impressive, with pseudo-R2 s of
0.371±0.389. The implied behavioural values of travel time savings (VTTS) for car travel are
respectively for Syd±Mel, Sydney and Melbourne $7.26/person hour, $7.74/person hour and
$4.50. The latter is based on a generic taste weight across all modes, which tends to de¯ate the carspeci®c value. The public transport linehaul VTTSs for Syd-Mel and Sydney are respectively $6.81
and $9.97; the equivalent access plus egress VTTSs for public transport are $4.08 and $4.15. The
Melbourne access plus egress VTTS is $2.89/person hour.
Comparison of the taste weights is a meaningless exercise since each model has a di€erent
scale parameter. Our preferred basis for comparison is the marginal e€ects and elasticities. To
demonstrate this, let us begin with the simple multinomial logit model with only the characteristics of each sampled individual in the utility expression, and taste weights not associated
with any particular outcome. The notation Pj is used for Prob…y ˆ j†. By di€erentiation, we ®nd
that
oProb…yq ˆ j†=obk ˆ Pk …1 ÿ Pk †x
ˆ ÿP0 Pk x

if j ˆ k;
if j 6ˆ k:

…1†

That is, every taste weight vector enters every probability. The taste weights in the model are not
the marginal e€ects. Indeed these marginal e€ects need not even have the same sign as the taste
weights. Hence the statistical signi®cance of a taste weight does not imply the same signi®cance for
the marginal e€ect
X

b ˆ
Pj bj : …defined below as dj †
…2†
oProb‰yq ˆ jŠ=ox ˆ Pj …bj ÿ b†;
j

It follows that neither the sign nor the magnitude of dj need bear any relationship to those of bj .
The asymptotic covariance matrix for an estimator of dj would be computed using
^ 0;
Asy:Var:‰d^j Š ˆ Gj Asy:Var‰bŠG
j
where b^ is the full parameter vector. It can be shown that
XX
Asy:Var:‰d^j Š ˆ
Vjl Asy:Cov:‰b^l ; b^0m ŠV0jm ; j ˆ 0; . . . ; J ;
l

m

where

Vjl ˆ ‰1…j ˆ l† ÿ Pl ŠfPj I ÿ dj x0 g ÿ Pj dl x0
and
1…j ˆ l† ˆ 1

if j ˆ l; and 0 otherwise:

…3†

…4†

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

163

Since bj ˆ o log …Pj =P0 †=ox, it has been suggested as an interpretation of the taste weights. ``Logit''
is not a natural unit of measurement, and is de®nitely not an elasticity. Thus the taste weights in
the multinomial logit model are essentially uninformative. This is why marginal rates of substitution (e.g. value of travel time savings), marginal e€ects and elasticities are the preferred
behavioural outputs for model comparison. For an MNL model in which attributes of alternatives are included as well as characteristics of sampled individuals, the marginal e€ects de®ned as
derivatives of the probabilities are given as:
djm ˆ oPj =oxm ˆ ‰1…j ˆ m† ÿ Pj Pm Šb:

…5†

The presence of the IIA property produces identical cross-e€ects. The derivative above is one
input into the more general elasticity formula:
gjm ˆ o log Pj =o log xm ˆ …xm =Pj †‰1…j ˆ m† ÿ Pj Pm Šb:

…6†

To obtain an unweighted elasticity for the sample, the derivatives and elasticities are computed by
averaging sample values. The empirical estimate of the elasticity is
!
!
Q
Q
i
X
1X 1 h
g^jm ˆ
w…q†h^jm …q† b;
…7†
1…j ˆ m† ÿ P^j …q†P^m…q† b ˆ
Q qˆ1 P^j …q†
qˆ1
where Pj (q) indicates the probability estimate for the qth observation and w…q† ˆ 1=Q. A problem
can arise if any single observation has a very small estimated probability, as it will blow up the
estimated elasticity. There is no corresponding e€ect to o€set this. Thus, a single outlying estimate
of a probability can produce unreasonable estimates of elasticities. To deal with this common
problem, one should compute ``probability weighted'' elasticities, by replacing the common
weight w…q† ˆ 1=Q with
P^j …q†
:
wj …q† ˆ PQ
qˆ1 Pj …q†

…8†

With this construction, the observation that would cause the outlying value of the elasticity automatically receives a correspondingly small weight in the average.
The parameter(s) of inclusive value(s) provides the basis for di€erences in cross-substitution
elasticities as compared to the independently and identically distributed (IID) condition of the
multinomial logit (MNL) model. The elasticity formulae for a nested logit model vary depending
on whether an alternative (for a direct elasticity) or a pair of alternatives (for a cross-elasticity) are
associated with the same branch of a nested partition. For the direct elasticity, it is identical to the
MNL formula for alternative m which is not in a partitioned branch (e.g. it exists in a non-nested
partition of tree). Where alternative m is in a partitioned part of the tree, the formula has to be
modi®ed to accommodate the correlation between alternatives within the branch. The NL direct
elasticity for a partitioned alternative is




1
…9†
…1 ÿ PmjG † bk Xmk :
…1 ÿ Pm † ‡
1 ÿ rG

164

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Table 7
Summary of nested logit training models marginal e€ects and direct elasticitiesa

a

Variable

Alternative

Syd±Mel

Sydney

Cost ($)

DA
RS
BS
TN
BW
LR

)0.93
)1.73
)0.43
)0.43
)0.54
)0.48

()6.3)
()5.68)
()3.61)
()3.73)
()4.55)
()5.93)

)1.14
)1.59
)0.38
)0.42
)0.50
)0.46

()7.99)
()5.77)
()3.59)
()3.42)
()4.49)
()4.67)

)1.87
)2.20
)0.57
)0.48
)0.65
)0.55

()12.11)
()9.42)
()4.18)
()5.58)
()5.42)
()5.97)

Linehaul time (min)

DA
RS
BS
TN
BW
LR

)1.01
)1.89
)1.92
)0.45
)0.52
)0.52

()0.76)
()0.69)
()0.44)
()0.42)
()0.52)
()0.56)

)1.34
)1.88
)1.85
)0.65
)0.72
)0.74

()1.03)
()0.74)
()0.46)
()0.57)
()0.75)
()0.78)

)1.25
)1.48
)0.35
)0.32
)0.41
)0.39

()0.91)
()0.71)
()0.31)
()0.34)
()0.41)
()0.45)

Access and egress time

BS
TN
BW
LR

)0.74
)0.33
)0.37
)0.37

()0.41)
()0.31)
()0.34)
()0.25)

)0.97
)0.30
)0.33
)0.37

()0.60)
()0.24)
()0.31)
()0.32)

)0.62
)0.27
)0.56
)0.28

()0.31)
()0.22)
()0.41)
()0.29)

Parking cost ($)

DA
RS

)0.42 ()1.11)
)0.80 ()1.00)

)0.45 ()1.26)
)0.63 ()0.91)

Melbourne

)0.73 ()1.79)
)0.86 ()1.39)

Marginal e€ects are in brackets and multiplied by 100.

The NL cross-elasticity for alternatives m and m0 in a partition of the nest is




rG
PmjG bk Xmk :
ÿ Pm ‡
1 ÿ rG

…10†

The direct elasticities and marginal e€ects are summarised in Table 7. The marginal e€ects
which de®ne the partial derivative of the probability of mode choice with respect to an attribute of choice; suggest that price has a greater impact than travel time; however when an
elasticity is calculated we ®nd that linehaul travel time is slightly more elastic than cost for car
for Sydney and the combined cities but less elastic for Melbourne. There appears to be no
consistent trend in the ordering of direct elasticities between Sydney and Melbourne; for example, Melbourne commuters appear to be more sensitive to in-vehicle and parking costs
compared to Sydney commuters, but the reverse applies for linehaul time except for drive
alone.

4. Arti®cial neural networks approach to commuter mode choice
What makes ANN di€erent from discrete choice methods is its use of pattern association and
error correction as the underlying mechanisms to represent a problem in contrast to the random
utility maximisation rule. It enables a full set of human perceptions about a particular problem to

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

165

Fig. 2. Structure of ANN and a selected neuron.

be represented by a network of neurons. The networkÕs structure shown in Fig. 2 was used for
Sydney, Melbourne and the pooled cities. The network consists of a number of successive layers of
neurons. There are three types of layers: input, hidden and output. The two outermost layers
correspond in one case to the layer which receives inputs from the external world (input layer) and
in the other to the layer which outputs the results of processing (output layer). The intermediate
layer is called hidden layer.
The neuron is the basic processing element of neural networks (see Fig. 2). Each neuron has one
output, which is related to the state of the neuron ± its activation ± and which may fan out to
several other neurons. Each neuron receives several inputs over these connections, called synapses.
All of the ``knowledge'' that a neural network possesses in the pattern association process is stored
in the synapses, the weights of the connections between neurons.
Once the knowledge is present in the synaptic weights of the network, presenting a pattern for
input to the network will produce the least error output. To build up the knowledge for a neural
network, we need to train the network by using sample data for every city. The network is taught
to produce the expected output for a given set of input patterns. Input patterns consist of attributes in Table 5 and the expected output represents the observed mode choice among a choice set
of six modes (DA, RS, TN, BS, BW and LR).
The inputs to every neuron are the activations of the incoming neurons multiplied by the
weights of the synapses. The activation of the neuron is computed by applying a threshold
function to this product. The threshold function is generally some form of non-linear function.
Fig. 3 describes two typical threshold functions: a step function (discrete) and a logistic function
(continuous). The logistic function was used in this research.

166

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Fig. 3. Two typical threshold functions of neural networks.

These functions can be represented mathematically as follows:
8
Step function
Logistic function
>
>
>
>
>
1
if x > 0;
>
>
>
> f 0 …x† if x ˆ 0;
>
>
>
>
<
where f 0 …x† refers
f …x† ˆ
>
to the previous value
f …x† ˆ 1=…1 ‡ eÿx †;
>
>
>
>
>
of f …x† that is; the activation
>
>
>
>
>
of the neuron will not change
>
>
:
ÿ1
if x < 0;

…11†

where x is the summation (over all the incoming neurons) of the product of the incoming neuronÕs
activation and the synaptic weight of the connection
n
X
Ai wi ;
…12†

iˆ0

where n is the number of incoming neurons, A the vector of incoming neurons and w is the vector
of synaptic weights connecting the incoming neurons to the neuron under study.
During the process of training, pattern associations (between input and associated expected
output) are presented to the network in sequence, and the weights are adjusted to capture this
knowledge. The weight adjustment scheme is known as the learning law. One of the learning
methods formulated was Hebbian learning. Hebb formulated the concept of ``correlation learning''. This is the idea that the weight of a connection is adjusted based on the values of the neurons
it connects
Dwij ˆ aai aj ;

…13†

where a is the learning rate, ai the activation of the ith neuron in one neuron layer, aj the activation of the jth neuron in another layer, and wij is the connection strength between the two
neurons. A variant of this learning rule is the signal Hebbian law
Dwij ˆ ÿwij ‡ S…ai †S…aj †;
where S is a sigmoid or logistic function as presented above.

…14†

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

167

Since the learning method just described does not test the resultant weights to see if they yield
acceptable output(s), this method is described as an unsupervised learning method. In general, an
unsupervised learning method is one in which weight adjustments are not made based on comparison with some target output. There is no ``teaching signal'' feed into the weight adjustments.
This property is known as self-organisation. Another form of training neural networks which has
gained in popularity is supervised learning. Input±output patterns are presented one after the
other to the neural network. The presentation of every input±output pattern to the network is
called a training cycle. Each cycle might involve many iterations for the network to adjust its
weights in an e€ort to match the desired output. This error correction mechanism can be expressed
as follows.
Dwij ˆ aai ‰cj ÿ bj Š;

…15†

where wij is the connection strength between the two neurons, a the learning rate, ai the activation
of the ith neuron, bj the activation of the jth neuron in the recalled pattern and cj is the desired
activation of the jth neuron.
The error correction mechanism is implemented using the back-propagation procedure which
transfers the error for every case (i.e. di€erence between the predicted and observed mode choice)
from the output layer back to hidden and input layers. Back-propagation neural models are the
most popular networks in applications (Faghri and Hua, 1991).
Three speci®c networks were constructed for Sydney, Melbourne and the pooled cities by
sharing the same three speci®c training data sets used to construct logit models (see Table 3). The
learning capacity of every network is built and evaluated during the training task. Every network
was tested to determine the ability of the network to generalise when presented with patterns on
which it was not explicitly trained. In other words, for a given input, the network is tested to see if
it can recall and or generalise its knowledge of associative network towards the estimation of an
accurate output within a speci®ed tolerance.
The following summarises our experiences in the process of building neural networks for the
three cities (Sydney, Melbourne and the pooled cities). First, all three di€erent cities use the
same network structure (see Fig. 2) so that a comparison can be made when di€erent training
and testing data are used (see Table 3). The common network consists of three layers: one input,
one hidden and one output layer. The input layer consists of 12 neurons representing the
choiceÕs attributes. A total number of six neurons was used to represent the six mode choice
decision vector in which at least one mode was chosen (associated chosen mode is set to 1
otherwise 0).
In searching for a suitable con®guration for hidden layers (i.e. number of hidden layers and the
associated number of neurons) we took into account the two aspects of network behaviour which
often work against one another. They are generalisation and convergence. Generalisation is the
ability of the network to produce reasonable results for novel or incomplete input data once
the training process has been completed. Convergence is the ability of the network to learn the
training data within the error tolerance speci®ed for the problem. We found that the more hidden
neurons are present, the greater the likelihood that the network will converge. However, when too
many hidden units were used, the network generalised poorly. The network only memorised the
training data rather than focused on signi®cant patterns arising from training data. We used as
many hidden neurons as necessary to ensure convergence without using so many as to inhibit

168

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Fig. 4. Typical training curve for Melbourne neural network.

generalisation. In this study, a 30-neuron hidden layer was selected. This decision was taken after
a number of test runs to test the sensitivity of the network performance due to the change in the
size of the hidden layer (from 20 to 40 neurons). The range from 20 to 40 neurons was selected due
to the size of input and output layers which have 12 and 6 neurons, respectively. It was found that
a 30-neuron hidden layer provided the best result without taking so much computing time as with
the 40-neuron hidden layer.
In addressing the issue of an ``overtrained'' neural network, a particular neural network was
trained by a di€erent number of training cycles (called epoch). An overtrained network might lose
its generalisation capability. A sensitivity analysis of network performance in terms of mean
square error (MSE) due to changes in the number of training cycle (from 100 to 10 000) epochs
was carried out (see Fig. 4). It was found that the MSE tends to stabilise at 1000 epoch for all
three neural networks representing three cities. In terms of computing time, a 1000 epoch training
of 641 records for the combined Sydney and Melbourne model took about two minutes on a
Pentium 150.

5. Comparison of the predictive potential of neural networks and nested logit models for commuter
mode choice
The neural networkÕs capacity for learning has been used to construct the models to represent
the three cities. In this section, the neural networkÕs capacity for generalisation is compared with
that of nested logit. The following model building and testing procedure was used for every city
(Sydney, Melbourne and the pooled cities).
· The same training data set (described in Tables 3 and 4) was used to construct neural network
and nested logit models.
· Each model (neural network or choice model) was evaluated by four test cases. Four di€erent
testing data were used in four test cases: one from the training data set that was used to train
the network, one from the testing data set of the same city, and two others from the testing data

169

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

sets from other two cities. The ®rst and second runs were to test memory recall capability and
generalisation capability of both models, respectively. The third and the fourth were used to
test the transferability of models for di€erent cities. For instance, as listed out in Table 4,
Syd_Train, Syd_Test, Mel_Test and SydMel_Test are the four data sets used for testing both
neural network and choice models for Sydney.
A prediction success table is used as a format for comparing the prediction capability of both
choice and ANN models for four test cases for every city. The evaluation measures include the
match between predicted share less observed share for every mode of travel and the weighted
percent correct and weighted success index. Prediction success tables for both choice and ANN
models are shown as follows for Sydney (see Table 8), Melbourne (see Table 9) and the pooled
cities (see Table 10).
For Sydney (see Table 8), choice models outperform ANN models in terms of the predicted
share less observed share and weighted percent correct measures. In terms of the weighted success
index, the ANN models perform reasonably well except for Case 2 with the Sydney model and
Sydney testing data. The classi®cation power of the ANN models is measured by their weighted
success indices. An interesting ®nding is in Case 3 with the Sydney model and Melbourne testing
data. In this case, although the choice model provides a better weighted percent correct it does not
produce a better weighted success index. In fact, the signi®cant gain for the ANN model in terms
of the weighted success index comes from the contribution of the percent correct for BW (busway)
and LR (light rail), the two Ônew modesÕ.
In comparing the Melbourne choice and Melbourne ANN models, the trend continues with the
choice models giving better predicted share less observed share and weighted percent correct
measures (see Table 9). Both the choice and ANN models are equal in the percent correct and the
weighted success index.

Table 8
Comparison between Sydney choice and Sydney ANN models
Case

Model

Predicted share less observed share
DA

1. Sydney model on
Combined Syd±Mel
testing data
2. Sydney model on
Syd testing data
3. Sydney model on
Mel testing data
4. Sydney model on
Syd training data

RS

BS

BW

TN

LR

Weighted
Percent
correct

Better

Better

Better

Better

Better

Choice

Better

ANN
Choice

Better

Better
Better

Better

Same

Better

Better

Better

ANN
Choice

Better

Better

Better

Same
Better

Better

Better

Better

ANN
Choice

Better

Better

Better

Better

Better

Same

ANN

Weighted
Success
index

Better
Better

Better

Same

Better

Better

170

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Table 9
Comparison between Melbourne choice and Melbourne ANN models
Case

Predicted share less observed share

Model

DA
5. Mel model on
combined Syd±Mel
testing data
6. Mel model on Syd
testing data
7. Mel model on Mel
testing data
8. Mel model on Mel
training data

RS

Choice

Same

ANN
Choice

Same

ANN
Choice

Better

ANN
Choice

Better
Better

BS

BW

Better

Better
Better

Better
Better

TN

LR

Better

Better

Better
Better

Better

Better

Better

Better

Better
Better

Better

Better

Better

ANN

Weighted
percent
correct

Weighted
success
index

Better

Better

Better

Better

Better

Better

Better

Better

Weighted
percent
correct

Weighted
success
index

Better

Better

Better
Better

Better

Better

Better

Table 10
Comparison between pooled cities choice and pooled cities ANN models
Case

9. Syd±Mel model
on Combined
Syd±Mel testing
data
10. Syd±Mel
model on Syd
testing data
11. Syd±Mel
model on Mel
testing data
12. Syd±Mel
model on Syd±
Mel training data

Model

Choice

Predicted share less observed share
DA

RS

BS

BW

Better

Same

Better

Better

ANN
Choice

Same
Better

ANN
Choice

Better
Better

ANN
Choice

Better

ANN

Better

Better

TN

LR
Better

Better
Better

Better
Better

Same

Better
Same

Better
Better

Better

Same
Better

Same
Better

Better

Better

Better

Better

Table 10 con®rms the ®nding from Tables 8 and 9. The strength of the choice model is clearly in
the area of matching the predicted share and observed share whereas the ANN models are good at
matching individual share.

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

171

6. Conclusions
With the capacity for learning driven by pattern association and error correction mechanisms,
the neural network method was used to construct three commuter mode choice neural network
models for Sydney, Melbourne and the pooled cities. Their capacity for generalisation was
compared with nested logit models. One important ®nding from this research is the con®rmation
of the predictive power of the choice modelling approach in matching the overall market share;
however the ANN models o€er comparative appeal in matching the market share of individuals.
There is no clear indication as to which approach is better.
An important issue relating to both ANN and choice models is that a behavioural rich
teaching data pro®le is required to train or construct both models. One important property of
ANN that has not been utilised in this study is the capability of distributing memory. In
neural networks ÔmemoryÕ corresponds to an activation map of the neurons; this map is in
some ways a coding of facts that are stored. Memory is thus distributed over many units,
giving a valuable property, resistance to noise. In the ®rst place, the loss of one individual
component does not necessarily cause the loss of a stored data item. This is di€erent for
discrete choice methods, in which missing or noisy data might have a signi®cant impact on the
performance of model. Preprocessing of data must therefore be used to eliminate the noise if a
statistical approach (e.g. choice models) is used. In a neural network the missing data or
destruction of one memory unit only marginally changes the activation map of the neurons.
This limitation is overcome in distributed memories such as neural networks, in which it is
possible to start with noisy data and seeking to make the correct data appear from the networkÕs activation map without noise. Further research will be carried out to test this feature of
neural networks.

Acknowledgements
The research and development reported in this paper is supported under the Australian Research Council Research Centres program. An earlier version was presented at the Eighth World
Conference on Transport Research held in Belgium in July 1998. The detailed comments of two
referees are appreciated.

References
Davalo, E., Naim, P., 1991. Neural Networks, translated by A. Rawsthorne. Macmillan, New York.
Faghri, A., Hua, J., 1991. Evaluation of arti®cial neural network applications in transportation engineering.
Transportation Research Record 1358, 71±80.
Hensher, D.A., Louviere, J.J., Swait, J., 1999. Combining sources of preference data. Journal of Econometrics 89, 197±
221.
Hensher, D.A., Milthorpe, F.W., Lowe, M., 1995. Greenhouse gas emissions and the demand for urban passenger
transport: ®nal report: summary of approach and selective results from application of the ITS/BTCE simulator,
Report 8. Institute of Transport Studies, The University of Sydney, November.

172

D.A. Hensher, T.T. Ton / Transportation Research Part E 36 (2000) 155±172

Louviere, J.J., Hensher, D.A., Anderson, D.A., Raimond, T., Battellino, H., 1994. Greenhouse gas emissions and the
demand for urban passenger transport: design of the stated preference experiments, Report 3. Institute of Transport
Studies, The University of Sydney, March.
Yang, H., Kitamura, R., Jovanis, P.P., Vaugh, K.M., Abdel-Aty, M.A., 1993. Exploration of route choice behaviour
with advanced traveler information using neural network concepts. Transportation 20 (2), 199±223.