Neural networks Directory UMM :Data Elmu:jurnal:E:Economics of Education Review:Vol18.Issue4.Oct1999:

406 B.D. Baker, C.E. Richards Economics of Education Review 18 1999 405–415 eral. Yet, future adoption of these methods in public fin- ance depends on our ability to establish standards for neural network application compared with conventional methods and applied to public finance forecasting prob- lems of interest. This study explores the potential value of using neural network methods alongside of more con- ventional regression methods used by the National Center for Education Statistics for forecasting edu- cational spending. A reasonable expectation, validated in other forecasting studies in public finance Hansen Nelson, 1997, is that neural networks, by way of flex- ible, non-linear estimation, are likely to reveal changes, or inflection points, in the general trend of education spending.

2. Neural networks

The primary objective of neural networks is predictive modeling. That is, the accurate prediction of non-sample data using models estimated to sample data. With cross- sectional data, this typically means the accurate predic- tion of outcome measures dependent variable of one data set generated by a given process, by providing input measures independent variables to a network deterministic non-linear regression equation trained estimated to a separate data set generated by the same process. With time-series data, the objective is typically forecasting, given a sample set of historical time-series realizations. This is a departure from traditional econo- metric modeling where a theoretically appropriate model is specified, then estimated using the full sample for pur- poses of hypothesis testing, the primary objective being inference. Identification of the best predicting model begins with subdividing the sample data set into two components, the training set and the test set, a hypothetical set of non- sample data extracted from the sample, against which prediction accuracy of preliminary models can be tested. Typically, the test set consists of up to 20 of the sam- ple Neuroshell 2 user’s manual, WSG, 1995, p. 101. For time-series modeling, the test set consists of the most recent realizations. The objective is to identify the model which, when estimated to the training set, most accu- rately predicts the outcome measures of the test set as measured by absolute error or prediction squared error. It is then expected that the same model will best predict non-sample data, sometimes referred to as the production set WSG, 1995, p. 101. Two methods are typically used for estimating the deterministic neural network model: 1 iterative conver- gent algorithms and 2 genetic algorithms. Superficially, the iterative, convergent method begins by randomly applying a matrix of coefficients connection weights to the relationships from each independent variable to the dependent variable of the training set. The weights are then used to predict the outcome measure of the test set. Prediction error is assessed, and either a new set of ran- dom weights is generated, or learning rate and momen- tum terms dictate the network to incrementally adjust the weights based on the direction of the error term from the previous iteration WSG, 1995, pp. 8, 52, 119. The pro- cess continues until several iterations pass without further improvement of test set error. The genetic algor- ithm approach begins by randomly generating pools of equations. Again superficially explained, initial equa- tions are estimated to the training set and prediction accuracy of the outcome measure is assessed using the test set to identify a pool of the “most fit” equations. These equations are then hybridized or randomly recom- bined to create the next generation of equations. That is, parameters from the surviving population of equations may be combined, or excluded to form new equations as if they were genetic traits. This process, like the iterative, convergent application of weights, continues until no further improvement in predicting the outcome measure of the test set can be achieved. A common concern regarding flexible non-linear mod- els is the tendency to “overfit” sample data Murphy, Fogler Koehler, 1994. It has been shown, however, that while iterative or genetic, selective methods can gen- erate complex non-linear equations that asymptotically fit the training set, the prediction error curve with respect to non-linear complexity for the test set is U Murphy et al, 1994 or V Farlow, 1984 shaped; that is, beyond an identifiable point, additional complexity erodes, rather than improves, prediction accuracy of the test set. Backpropagation algorithms are among the oldest, and until recently, most popular neural networks Caudill, 1995, p. 5. Backpropagation primarily describes an esti- mation procedure. Apply the previously discussed iterat- ive estimation method to the regression equation Y 5 b 1 X 1 1 e Iterative estimation of the equation involves selecting the initial value for b at random, evaluating the predic- tion error of Y, and incrementally adjusting b until we have reduced prediction error to the greatest extent poss- ible. This description represents a “single output, feed forward system with no hidden layer and with a linear activation function” McMenamin, 1997, p. 17. The typical backpropagation neural network consists of three layers, and can be represented in regression terms as Y 5 F [H 1 X, H 2 X, …, H N X] 1 e where Y is the output which is a function of X that has been rescaled through a series of “hidden” layer func- tions H McMenamin, 1997. In this specification the same X may or will be represented multiple times in the hidden layer, once in each hidden layer neuron rescaled vectors of inputs. Where more than one X exists, the 407 B.D. Baker, C.E. Richards Economics of Education Review 18 1999 405–415 same applies for all Xs. While for inferential purposes this replication results in irresolvable collinearities, in the three-layer backpropagation network it allows for alter- nate weighting schemes to be applied to the same inputs, creating the possibility of different sensitivities of the outcome measure at different levels of each input, resulting in heightened prediction accuracy. The resca- ling procedure, activation function or “hidden layer transfer function” McMenamin, 1997, sometimes referred to as squashing Rao Rao, 1993, typically involves rescaling all inputs to a sigmoid distribution using either a logistic or hyperbolic tangent function. Backpropagation has proven an effective tool for both time-series prediction Hansen Nelson, 1997; Lachter- macher Fuller, 1995 and cross-sectional prediction Buchman et al, 1994; Odom Sharda, 1994; Worzala et al, 1995. Two alternatives used in addition to backpropagation in this study are 1 Generalized Regression neural net- works GRNN Specht, 1991 and 2 Group Method of Data Handling GMDH polynomial neural networks Farlow, 1984. Both involve identifying best predicting non-linear regression models. An advantage of Specht’s GRNN is removal of the necessity to specify a functional form by using the observed probability density function pdf of the data Caudill, 1995, p. 47. GRNN interp- olates the relationship between inputs, and inputs and outcomes by applying smoothing parameters a to mod- erate the degree of non-linearity in the relationships and serve as a sensitivity measure of the non-linear response of the outcome to changes in the inputs. Smoothing para- meters typically vary among model inputs with the opti- mal combination of smoothing parameters being selected by 1 a holdout method 1 or 2 a genetic adaptive method. 2 GRNN has been implicated for effective cross- sectional prediction of binary outcomes Buchman et al, 1994 and recommended for time-series prediction, parti- cularly for use with sparse data and data widely varying in scale Caudill, 1995, p. 47. A.G. Ivakhnenko 1966, in Farlow, 1984 proposed GMDH for identifying a best prediction polynomial via a Kolmogorov–Gabor specification. 3 GMDH polynomial fitting differs from backpropagation and GRNN in that 1 Described by Specht 1991 but not used in this study. Thus due to space constraints we opt not to discuss this method furth- er. 2 Recommended for identifying best predicting models where “input variables are of different types and some may have more of an impact on predicting the output than others” using Neuro- shell 2 WSG, 1995, p. 138. 3 y 5 a 1 O M i 5 1 a i x i 1 O M i 5 1 O M j 5 1 a ij x i x j 1 O M i 5 1 O M j 5 1 O M k 5 1 a ijk x i x j x k , where Xx 1 , x 2 , … x m is the vector of inputs and Aa 1 , a 2 , … a m is the vector of coefficients or weights Liao, 1992. no training set is specified. Rather a measure referred to as FCPSE Full Complexity Prediction Squared Error is used. FCPSE consists of a combination of Training Squared Error 4 combined with an overfitting penalty similar to that used for the PSE Prediction Squared Error 5 but including additional penalty measures for model complexity. 6 Also unlike backpropagation, GMDH generally applies linear scaling to inputs. 7

3. Methods