Discussion Directory UMM :Data Elmu:jurnal:A:Aquaculture:Vol187.Issue1-2.Jul2000:

Recall that only six variables were chosen in the final logistic regression model. These same six variables were used to build another PNN model. Table 4 shows the classification accuracy of this PNN model on the estimation and validation subsets.

4. Discussion

Ž . Results show that the PNN model using the full set of input explanatory variables has a better predictive power than the final logistic regression model with six explana- Ž . tory variables Table 2 . However, if the same six input variables as in the final logistic regression model were used, results from PNN are worse than those of logistic Ž . regression model Table 3 . With 62 more variables, the prediction accuracy of the full PNN model improves by only 8.13 in the estimation subset and 6.31 in the validation subset. One point that is often used to explain the better prediction power of Ž PNN is the ability to detect all possible interactions between explanatory variables Tu, . 1996 . It is interesting to note by forcing all input variables into the logistic regression model that a prediction accuracy of 90.24 in the estimation subset and 74.77 in the validation subset were attained. While the prediction accuracy for the estimation subset Ž . is very similar to the full PNN model including all input variables 90.24 vs. 91.06 , the prediction accuracy for the validation subset is significantly lower than the full PNN Ž . model 74.77 vs. 86.49 . In fact, the full logistic regression model including all explanatory variables performs even worse than the final logistic regression model with Ž . six input variables for the validation subset 74.77 vs. 80.18 . This is probably due to over-fitting of the full logistic regression whereby the prediction accuracy of the out-of-sample validation set is significantly worse than that of the in-sample estimation subset. Furthermore, most of the estimated coefficients in this full regression model are not statistically significantly different from zero and exhibit high degree of multicolin- earity. Hence, the model would not provide meaningful parameter estimates. The better out-of-sample performance of the full PNN model over the full logistic regression model in this case may be explained by the fact that the disease prediction problem at hand does exhibit some degree of nonlinearity when all variables are considered. PNN is often superior to conventional statistical tools to detect complex nonlinear relationship be- tween independent and dependent variables. A benefit of the logistic regression model is that its parameters are transparent, aiding identification of factors that affect the dependent variable. In this problem, all input variables extracted in the logistic regression model appear to be explainable on a biological basis although they were never tested statistically. For example, polyculture in shrimp farming reduces the chance of shrimp disease. Discharging water into intakerdrainage canal increases the chance of getting disease. Furthermore, from values of parameters in the logistic regression model, one can estimate the probability of disease occurrence when one input variable increases or decreases one unit. On the other hand, the PNN model is a black box. Although there are contribution Ž . factors or weights associated with each input variable in the PNN model, they are generally not so useful in explaining the level of contribution of each. In the optimiza- tion process of the PNN, several scale functions were usually tested in the input layer to choose the one giving the best prediction. In our case, prediction accuracy from models with different scale functions differs only by a couple of percentages; weights of input variables, however, change significantly from model to model. One input variable can have a very high weight in one model, but very low in another, suggesting that weights in PNN are not reliable in explaining the contribution of input variables. Besides methodological issues as described above, PNN model development requires much more time and greater computational resources as compared with those for conventional statistical models.

5. Conclusion