Model constraints Directory UMM :Data Elmu:jurnal:I:Insurance Mathematics And Economics:Vol26.Issue2-3.2000:

A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 303 Fig. 15. Model performance low signal-to-noise case. very well. This is a consequence of the many assump- tions made in a linear model about the underlying structure. In contrast, as the solution tends to the stan- dard canonical nonlinear architectures, where fewer and fewer assumptions are made, the ability to capture the nonlinearities in the problem are improved. This was an important result, since relatively simple com- ponents can be pieced together to capture very non- linear behavior. 11.5. Financial models Fig. 15 portrays the complication that occurs when the foregoing technologies are applied to low signal-to-noise situations, such as those that often accompany financial modeling. Now the nonlinear model does not capture the nonlinear process the solid line very well. The reason being that, while NNs and architectures of that type have low bias, very few assumptions are made and there is a tendency to overfit. This, coupled with finite sample data, leads to significant problems with the variance. So, depending on the initial conditions and the sample used from the overall population, widely varying solutions can be obtained. In many cases, linear solutions are better in the sense that they capture the underlying structure, at least the first-order structure, much better than the high dimensional low bias models. This follows be- cause of the bias imposed on the solution in the linear technique. The foregoing anomaly arose because of the enor- mous change in the underlying characteristics of the problem. Initially, the problem involved improving the classification performance or decision performances from the 80 range up to the 95–98 range. When it came to financial issues, however, the problem be- came one of achieving one or two percentage points over chance, and it was clear that if this was to be accomplished, the high variance issue had to be ad- dressed. Part of the solution involved domain segmen- tation, variable selection, the use of aggregates, and so forth. In addition, embedded expert knowledge was used to impose constraints on the solution of these low bias models in order to avoid the problem of overfit- ting. This is a very heuristic and ad hoc approach but, to date, there is no satisfactory alternative.

12. Model constraints

Turning now to model constraints, Fig. 16 lists types of networks that have been adopted, and their associ- ated technologies, in the order of the extent to which they can be constrained. The first set of technologies listed has the lowest bias in the sense that it involves making the least number of assumptions, although it is the more sen- sitive when confronted with finite samples or low signal-to-noise. Further down the list are approaches that constrain the problem more and more to the point where if there are very few samples and very noisy data some heuristic information has to be embed into the problem to get it to converge properly. 12.1. Neural networks The unconstrained, unbiased models include var- ious types of networks. The traditional multilayer 304 A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 Fig. 16. Networks and their associated technologies. perception Wang et al., 1995, p. 39, the finite im- pulse response FIR networks Hayes, 1996, p. 12 used for capturing nonstationary temporal behav- ior, and the gated experts Atiya et al., 1998 which attempt to dynamically determine the boundaries within the population while the models within those boundaries are being optimized. 33 Unsupervised and autoassociative networks are also used for doing clustering and compression and these are typically networks that do not use any kind of output in the determination of an optimal solution. They simply look at correlations within the data to determine groupings. 12.2. Nonlinear kernel networks The next level of technologies typically used are kernel techniques Duflo, 1997, Chapter 7 which can be employed to impose a number of constraints on the convergence process and vastly reduce the num- ber of parameters that have to be optimized. These include radial-basis functions 34 Wang et al., 1995, p. 42, which are hypersphere-type functions, gener- 33 Gated experts have a lot of promise but have been found to be very sensitive to sample size and noise. Essentially, there is a gate that learns to determine which one of these experts, all of whom are training on the same data, are doing the best job. This information is used to begin to cordon off which part of the population each one of these experts focuses on. Thus, the results is the best of both worlds in the sense that both domain segmentation and modeling are carried out at one and the same time. For certain problems this has been a very powerful approach; for many problems, however, it does not work. 34 These are second-order nonlinear basis functions. alized regression neural networks GRNNs Master, 1995, which involve a three-layer network with one hidden neuron for each training pattern, 35 and Gabor Networks Feichtinger and Strohmer, 1997, which in- volve the simultaneous analysis of signals in time and frequency. Most commonly, there will be some form of the Gaussian kernel Bishop, 1995, Section 2.5.3 that is centered positioned to centroid within the sample space and the variance associated with that kernel can be either fixed a priori or adapted, depending on the problem. This lends a kind of bias to the problem and allows small sample sizes and noisy situations to be accommodated. 12.3. Neuro-fuzzy networks The final set of technologies is neuro-fuzzy net- works, which combine the architecture and learning properties of an NN with the representational advan- tages of a fuzzy system. Wang et al., 1995, p. 92. They include rule-induction technologies, bordering on rule-based technologies, where membership func- tions can be defined with more or less precision. The sample data is allowed to determine the bound- aries and the extent of those membership functions. Again, these technologies are used in the case where there is an enormous amount of noise and small samples. 35 GRNN works by measuring how far a given sample pattern is from patterns in the training set in N dimensional space, where N is the number of inputs. It is often used for the estimation of continuous variables. A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 305

13. Model parameter optimization