The model development process

A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 297 6.2. Data issues There are many issues with data if there is no theo- retical framework to constrain the solution, since the resolution of the problem depends on and is highly sensitive to the nature of the sample data. As a con- sequence, considerable resources are devoted to pro- cessing the data, with an emphasis on missing and corrupted data and the removal of bias from the sam- ple. Additionally, where multiple sources of data are involved, the consistency of the differential semantics across these sources have to be verified. 6.3. Emphasis on nonlinear relationships A distinguishing assumption of this approach is that there are important nonlinearities both between the ob- servables independent variables and the dependent variable as well as nonlinearities among the observ- ables. The emphasis is on not making unjustified as- sumptions about the nature of those nonlinearities, and technologies are used that have the capacity, in theory at least, to extract the appropriate interaction terms adaptively. 6.4. Domain knowledge As mentioned previously, the technologies do not al- ways achieve their ends because of the signal-to-noise ratios in the sample data. Of necessity, in these in- stances, the approach is to constrain the solution by introducing expert knowledge into the process. So, it is not quite a theoretical framework but it is more a heuristic framework to help constrain the solution space.

7. The model development process

An overview of the key features of the general model development process is shown in Fig. 11 9 and previewed in this section. The process involves data preprocessing, domain segmentation, variable selec- tion, model development, and benchmarking and ver- ification. 9 Adopted from Gorman 1996, Slide 7. Fig. 11. Model development process. 7.1. Data preprocessing The data preprocessing stage focuses on the reduc- tion of inconsistencies and bias in the data, and the development of aggregate information as a proxy for relevant individual characteristics. 7.2. Domain segmentation Domain segmentation is actually a part of data pre- processing, but because of its importance, it is shown here as a separate step. In addition to rule induction, which was mentioned previously, it involves such tech- nologies as supervised and unsupervised clustering and gated architectures, each of which is discussed be- low. If it is appropriate, models are attempted within these domains. 7.3. Variable selection One way to reduce the amount of variability in the model is to constrain the number of predictors. Of course, this process must be balanced with the need to preserve information, and this can be accomplished using traditional approaches like regression analysis and sensitivity analysis. The essence of this process was discussed by Brockett et al. 1994, pp. 411–412. In some cases, it is possible to use technologies that prune the parameters as the model learns the prob- lem. 10 Similarly, weight pruning 11 and decay 12 can be used. 10 An example of parameter pruning would be the discarding of subordinate solutions during the training stage of an GA. 11 Weight pruning refers to the adjustment of weights in a weighted procedure. An example is the adjustment of weights that takes place through the back-propagation algorithm of NNs. 12 Decay emulates the process of forgetting over time. If, for example, the model incorporated a series of patterns where the weight associated with the most recent pattern is 1, decay could be modeled by assigning pattern n a weight of w n , 0≤w≤1. 298 A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 The technologies mentioned earlier in this section were listed in the order of the amount of manual heuristic knowledge inherent in each stage. Ideally, tasks are pushed down to where the development is automatic and the structure in the data is used to ex- tract domain boundaries and information in the data is used to extract the interaction terms. Again, however, the process can be thwarted by small sample size and poor signal-to-noise ratios. 7.4. Model development Once the best performance predictors have been identified, the next step is the development of the non- linear model, and a considerable portion of this paper is devoted to that topic. Related issues that need to be reconciled are the advantages and disadvantages of both the linear paradigm and nonlinear paradigm, and the reasons for taking on the complexities of trying to extract nonlinearities. 7.5. Benchmarking and validation The final step in the model development process is benchmarking and model validation. The latter is a part of comparative performance testing and is done iteratively during model development to verify that if the approach adds complexity, it also adds comparable value. 13 It should be clear that the approach is very empiri- cal and that the nature of the problem determines the approach. This is even more apparent in the remainder of the paper where the details of each of these steps is discussed.

8. Data preprocessing