A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 297
6.2. Data issues There are many issues with data if there is no theo-
retical framework to constrain the solution, since the resolution of the problem depends on and is highly
sensitive to the nature of the sample data. As a con- sequence, considerable resources are devoted to pro-
cessing the data, with an emphasis on missing and corrupted data and the removal of bias from the sam-
ple. Additionally, where multiple sources of data are involved, the consistency of the differential semantics
across these sources have to be verified.
6.3. Emphasis on nonlinear relationships A distinguishing assumption of this approach is that
there are important nonlinearities both between the ob- servables independent variables and the dependent
variable as well as nonlinearities among the observ- ables. The emphasis is on not making unjustified as-
sumptions about the nature of those nonlinearities, and technologies are used that have the capacity, in theory
at least, to extract the appropriate interaction terms adaptively.
6.4. Domain knowledge As mentioned previously, the technologies do not al-
ways achieve their ends because of the signal-to-noise ratios in the sample data. Of necessity, in these in-
stances, the approach is to constrain the solution by introducing expert knowledge into the process. So, it
is not quite a theoretical framework but it is more a heuristic framework to help constrain the solution
space.
7. The model development process
An overview of the key features of the general model development process is shown in Fig. 11
9
and previewed in this section. The process involves data
preprocessing, domain segmentation, variable selec- tion, model development, and benchmarking and ver-
ification.
9
Adopted from Gorman 1996, Slide 7. Fig. 11. Model development process.
7.1. Data preprocessing The data preprocessing stage focuses on the reduc-
tion of inconsistencies and bias in the data, and the development of aggregate information as a proxy for
relevant individual characteristics.
7.2. Domain segmentation Domain segmentation is actually a part of data pre-
processing, but because of its importance, it is shown here as a separate step. In addition to rule induction,
which was mentioned previously, it involves such tech- nologies as supervised and unsupervised clustering
and gated architectures, each of which is discussed be- low. If it is appropriate, models are attempted within
these domains.
7.3. Variable selection One way to reduce the amount of variability in the
model is to constrain the number of predictors. Of course, this process must be balanced with the need
to preserve information, and this can be accomplished using traditional approaches like regression analysis
and sensitivity analysis. The essence of this process was discussed by Brockett et al. 1994, pp. 411–412.
In some cases, it is possible to use technologies that prune the parameters as the model learns the prob-
lem.
10
Similarly, weight pruning
11
and decay
12
can be used.
10
An example of parameter pruning would be the discarding of subordinate solutions during the training stage of an GA.
11
Weight pruning refers to the adjustment of weights in a weighted procedure. An example is the adjustment of weights that takes
place through the back-propagation algorithm of NNs.
12
Decay emulates the process of forgetting over time. If, for example, the model incorporated a series of patterns where the
weight associated with the most recent pattern is 1, decay could be modeled by assigning pattern n a weight of w
n
, 0≤w≤1.
298 A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307
The technologies mentioned earlier in this section were listed in the order of the amount of manual
heuristic knowledge inherent in each stage. Ideally, tasks are pushed down to where the development is
automatic and the structure in the data is used to ex- tract domain boundaries and information in the data is
used to extract the interaction terms. Again, however, the process can be thwarted by small sample size and
poor signal-to-noise ratios.
7.4. Model development Once the best performance predictors have been
identified, the next step is the development of the non- linear model, and a considerable portion of this paper
is devoted to that topic. Related issues that need to be reconciled are the advantages and disadvantages of
both the linear paradigm and nonlinear paradigm, and the reasons for taking on the complexities of trying to
extract nonlinearities.
7.5. Benchmarking and validation The final step in the model development process is
benchmarking and model validation. The latter is a part of comparative performance testing and is done
iteratively during model development to verify that if the approach adds complexity, it also adds comparable
value.
13
It should be clear that the approach is very empiri- cal and that the nature of the problem determines the
approach. This is even more apparent in the remainder of the paper where the details of each of these steps
is discussed.
8. Data preprocessing