Data preprocessing Directory UMM :Data Elmu:jurnal:I:Insurance Mathematics And Economics:Vol26.Issue2-3.2000:

298 A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 The technologies mentioned earlier in this section were listed in the order of the amount of manual heuristic knowledge inherent in each stage. Ideally, tasks are pushed down to where the development is automatic and the structure in the data is used to ex- tract domain boundaries and information in the data is used to extract the interaction terms. Again, however, the process can be thwarted by small sample size and poor signal-to-noise ratios. 7.4. Model development Once the best performance predictors have been identified, the next step is the development of the non- linear model, and a considerable portion of this paper is devoted to that topic. Related issues that need to be reconciled are the advantages and disadvantages of both the linear paradigm and nonlinear paradigm, and the reasons for taking on the complexities of trying to extract nonlinearities. 7.5. Benchmarking and validation The final step in the model development process is benchmarking and model validation. The latter is a part of comparative performance testing and is done iteratively during model development to verify that if the approach adds complexity, it also adds comparable value. 13 It should be clear that the approach is very empiri- cal and that the nature of the problem determines the approach. This is even more apparent in the remainder of the paper where the details of each of these steps is discussed.

8. Data preprocessing

The primary considerations in data preprocessing are to reconcile disparate sources of data, to reduce or eliminate intrinsic data bias, and to aggregate vari- ables, when appropriate. These issues are addressed in this section. 13 The accounting profession refers to this consideration as the “materiality criterion”. 8.1. Reconcile disparate sources of data Generally, a number of sources of data are needed to develop the model. This might include insurers and agencies, household demographics, econometrics data, and client internal transaction data. Conse- quently, reconciling disparate sources of data becomes critical. 8.2. Intrinsic data bias Another challenge when dealing with data is to re- duce some of its internal biases. In the area of con- sumer behavior, for example, where adverse selection is the issue, the insured data base may provide limited guidance in some cases because it contains only in- sureds, and people that need to be identified on the ad- verse side already have been selected away. So, strate- gies need to be developed to compensate for these bi- ases. 8.3. Aggregate variables One productive approach is to develop a set of ag- gregate variables to help take the raw state of these sources of variables and bring them together into a concise set of aggregates. A common example of this is the use of residential areas as a proxy for socioeco- nomic characteristics. 14 Where this is done, that level will typically be used to begin the modeling process. As discussed by Bishop 1995, Section 8.6.2, one might approach this issue using a kind of a neuro-network architecture that is autoassociative, 15 which tries to predict the same patterns that are at the input at the output through a narrow number of units. This results in compression and at the same time takes advantage of nonlinearities or interaction terms between the observables. 14 The use of aggregate information as a proxy for the individual characteristics of interest has to be used with care because it can result in biases. The reason for this is that there is a tendency for aggregate proxies to exaggerate the effects of micro-level variables and to do more poorly than micro-level variables at controlling for confounding. This has been found, for example, when socioeconomic characteristics of residential areas such as median income associated with a zip code are used to proxy for individual characteristics. See Geronimus et al. 1996. 15 An autoassociative network is a network where the target data set is identical to the input data set. A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 299 Another approach would be to use nonlinear com- pression, which is kind of the nonlinear correlate to factor analysis 16 or principle components. 17 This can be accomplished, for example, with a four-layer au- toassociative network, where the first and third hidden layers have sigmoidal nonlinear activation functions.

9. Domain segmentation