Model parameter optimization Directory UMM :Data Elmu:jurnal:I:Insurance Mathematics And Economics:Vol26.Issue2-3.2000:

A.F. Shapiro, R. Paul Gorman Insurance: Mathematics and Economics 26 2000 289–307 305

13. Model parameter optimization

Once the particular type of technology that will be used to model the problem is determined, the next step is to determine the parameters of that model. As mentioned previously, since there is no analytical closed form solutions to many of these problems, an incremental numerical optimization technique must be used. 13.1. Gradient descent — continuous optimization The workhorse for networks is back-propagation, which simply measures the gradient of the error with respect to each one of the parameters and back-propagates that error through the nonlinearities, so that a global minimum is obtained at the minimum for which the value of the error function is smallest. In this context, the error minimization process can be conceptualized Bishop, 1995, p. 254 by envisioning the error function as an error surface sitting above weight space. Second-order gradient technologies can be used, just as in any other optimization problem. 36 The gated expert, mentioned above, employs an expectation-maximization EM technique Couvreur, 1997 which uses a maximum likelihood approach for optimizing two aspects of the problem at once, that is, which expert to use for which domain. This opti- mization is taking place at the same time the internal parameters associated with the network are being op- timized, and the EM optimization technique works well in that context. 13.2. Genetic evolution — discontinuous optimization Another class of optimization technologies which use GAs is well suited for error surfaces that are very convoluted. As described above, typically what is done in a GA is to set up components of a model that compete on the basis of fitness and cooperate in accordance with genetic operations. A recombination technique is used to generate new generations of the components that perform well, while the ones that do not perform well fall by the wayside. The power here is that the optimization process starts from multiple 36 Momentum is a network attempt at capturing second-order information about the gradient. points on the error surface and moves down the gra- dients and many of the components that get stuck in suboptimal local minima are eliminated and the ones that achieve more global performance persist. So, it has considerable power when dealing with very noisy error surfaces. 13.3. Unsupervised clustering As discussed previously, some of the technologies used for clustering and compressing multidimensional data into a lower dimensional space use unsuper- vised learning which attempts to cluster the under- lying data adaptively. Fruitful methodologies include the Hebbian-covariance networks Domany et al., 1996, p. 61, which are based on the proposition that synaptic change depends on the covariance of post- synaptic and presynaptic activity, and the Kohonen’s self-organizing map, whereby each input pattern leads to a single, localized cluster of activity.

14. Benchmarking and model validation