298 A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307
The technologies mentioned earlier in this section were listed in the order of the amount of manual
heuristic knowledge inherent in each stage. Ideally, tasks are pushed down to where the development is
automatic and the structure in the data is used to ex- tract domain boundaries and information in the data is
used to extract the interaction terms. Again, however, the process can be thwarted by small sample size and
poor signal-to-noise ratios.
7.4. Model development Once the best performance predictors have been
identified, the next step is the development of the non- linear model, and a considerable portion of this paper
is devoted to that topic. Related issues that need to be reconciled are the advantages and disadvantages of
both the linear paradigm and nonlinear paradigm, and the reasons for taking on the complexities of trying to
extract nonlinearities.
7.5. Benchmarking and validation The final step in the model development process is
benchmarking and model validation. The latter is a part of comparative performance testing and is done
iteratively during model development to verify that if the approach adds complexity, it also adds comparable
value.
13
It should be clear that the approach is very empiri- cal and that the nature of the problem determines the
approach. This is even more apparent in the remainder of the paper where the details of each of these steps
is discussed.
8. Data preprocessing
The primary considerations in data preprocessing are to reconcile disparate sources of data, to reduce
or eliminate intrinsic data bias, and to aggregate vari- ables, when appropriate. These issues are addressed
in this section.
13
The accounting profession refers to this consideration as the “materiality criterion”.
8.1. Reconcile disparate sources of data Generally, a number of sources of data are needed
to develop the model. This might include insurers and agencies, household demographics, econometrics
data, and client internal transaction data. Conse- quently, reconciling disparate sources of data becomes
critical.
8.2. Intrinsic data bias Another challenge when dealing with data is to re-
duce some of its internal biases. In the area of con- sumer behavior, for example, where adverse selection
is the issue, the insured data base may provide limited guidance in some cases because it contains only in-
sureds, and people that need to be identified on the ad- verse side already have been selected away. So, strate-
gies need to be developed to compensate for these bi- ases.
8.3. Aggregate variables One productive approach is to develop a set of ag-
gregate variables to help take the raw state of these sources of variables and bring them together into a
concise set of aggregates. A common example of this is the use of residential areas as a proxy for socioeco-
nomic characteristics.
14
Where this is done, that level will typically be used to begin the modeling process.
As discussed by Bishop 1995, Section 8.6.2, one might approach this issue using a kind of a
neuro-network architecture that is autoassociative,
15
which tries to predict the same patterns that are at the input at the output through a narrow number of
units. This results in compression and at the same time takes advantage of nonlinearities or interaction
terms between the observables.
14
The use of aggregate information as a proxy for the individual characteristics of interest has to be used with care because it can
result in biases. The reason for this is that there is a tendency for aggregate proxies to exaggerate the effects of micro-level
variables and to do more poorly than micro-level variables at controlling for confounding. This has been found, for example,
when socioeconomic characteristics of residential areas such as median income associated with a zip code are used to proxy for
individual characteristics. See Geronimus et al. 1996.
15
An autoassociative network is a network where the target data set is identical to the input data set.
A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 299
Another approach would be to use nonlinear com- pression, which is kind of the nonlinear correlate to
factor analysis
16
or principle components.
17
This can be accomplished, for example, with a four-layer au-
toassociative network, where the first and third hidden layers have sigmoidal nonlinear activation functions.
9. Domain segmentation