Methods and data Directory UMM :Data Elmu:jurnal:A:Aquaculture:Vol187.Issue1-2.Jul2000:

regional study in 1990 which concluded that the diseases of aquatic animals and plants are closely linked to the environment and that environmental issues, including fish disease control, must be considered in the broader context of fish farming systems, design, site selection and management. The specific objective of the 1994 study was to assist governments in assessing policy options and formulating policies designed to improve the sustainability of the aquaculture industry. A detailed survey of almost 11,000 shrimp and carp farms was undertaken covering 16 countries and territories in the region. The shrimp survey extended to 2898 extensive, 1022 semi-intensive and 870 intensive shrimp farms. The survey results show that shrimp disease contributed to significant regional losses. A conservative US332.2 million per year was estimated as the total loss attributed to shrimp diseases: US143.3 million to intensive farms, US111.8 million to semi-inten- sive farms, and US77.1 million to extensive farms. The countries involved in the survey all suffered in various degrees from disease problems. For example, the propor- Ž . tion of intensive farms affected by disease defined as more than 20 stock losses was high in most countries, ranging from 12 in Malaysia to 100 in China. Semi-intensive and extensive farms were also reporting significant losses due to disease problems Ž . ADB–NACA, 1998 . The survey results also indicate that virtually all countries reported ‘unknown’ as the cause of the shrimp disease problems. As the causes of shrimp disease are poorly understood, research and improved extension activities are needed in properly identify- ing shrimp disease problems, and their prevention and cure. In this paper, we attempt to predict the occurrence of shrimp diseases based on farm site selection, design, and farm management practices. Prior research on disease prediction has essentially depended upon traditional statistical models with varying degrees of prediction accuracy. Further- more, the application of these models in sustainable aquaculture development and in controlling environmental deterioration has been very limited. In an attempt to look for a Ž . more reliable model, we developed a probabilistic neural network PNN to predict shrimp disease outbreaks in Vietnam using the ADB–NACA farm-level data from 480 Vietnamese shrimp farms. We also compared predictive performance of the PNN against the more traditional logistic regression approach on the same data set.

2. Methods and data

Statistical regression models are the most commonly used techniques for disease prediction. The logistic regression model has emerged as the technique of choice in Ž . predicting dichotomous medical outcomes Tu, 1996 . While disease prediction models have been widely used to predict incidence of either pests or pathogens in the field for crop protection and diseases in land animals, disease prediction models applied to aquaculture are almost non-existent. For predicting dichotomous outcomes such as the occurrence of disease, logistic regression has been the most appropriate technique. However, the recent development of Ž . artificial neural networks ANNs provides a new alternative to logistic regression, particularly in situations where the dependent and independent variables exhibit complex nonlinear relationships. There are numerous applications of ANNs in the literature ranging from business and finance to agriculture and ecology. The potential of ANNs in predicting dichotomous outcomes compared to logistic regression has also been evalu- ated in several areas of applications. All reported comparisons in the literature seem to show that ANNs out-performed the traditional logistic regression approach. Starrett et al. Ž . 1997 used both ANNs and logistic regression to predict percentage of applied nitrogen Ž . leached under turfgrass. Paruelo and Tomasel 1997 compared the performance of ANNs and logistic regression models in predicting ecosystem attributes. Horimoto et al. Ž . 1997 evaluated the prediction performance of ANNs, logistic regression, and principal components in classifying microbial defects in milk. In the area of finance, Maher and Ž . Sen 1997 compared the prediction accuracy of ANNs and logistic regression in predicting bond ratings. All the cases cited above have demonstrated the superior predictability of ANN models over logistic regression models. A brief discussion of logistic regression and ANNs follows. 2.1. Logistic regression Logistic regression or logit analysis is a popular statistical modeling technique in which the probability of a dichotomous outcome is related to a set of potential explanatory variables in the form: log pr 1 y p s b q b x q b x q . . . qb x Ž . 1 1 2 2 n n where p is the probability of the outcome, b , . . . , b are the coefficients associated 1 n with each explanatory variable x , . . . , x . The dependent variable is the logarithm of the 1 n odds, which is the logarithm of the ratio of two probabilities: the probability that a disease outbreak will occur divided by the probability that it will not. The logarithm of w Ž .x4 the odds log pr 1 y p is related in a linear manner to the potential explanatory variables. Where there is no available theoretical model, explanatory variables are usually selected through some specific techniques such as backward or forward stepwise regression with different criteria to include or to reject an explanatory variable. Although different techniques might give different regression models, they are often very similar. The maximum-likelihood method is used to estimate the coefficients b , . . . , b in the 1 n logistic regression. The logistic regression procedure in the SPSS package was used in Ž . this analysis SPSS, 1992 . 2.2. Probabilistic neural network PNN Ž ANNs are algorithms patterned after the structure of the human brain Harston, . 1990 . In an ANN, processing elements — units analogous to biological neurons — are organized into groups called layers. An ANN includes a sequence of input, hidden Ž . layer s between input and output layers, and output layers interconnected in many Ž . different ways Maren, 1990a,b . Data are introduced at the input layer and the ANN’s response in accordance to the input data is generated at the output layer. The hidden Ž . layers allow the network to generate numerous relationships mapping functions between the inputs and outputs so that the desired outputs can be produced using a given set of inputs. Interaction between processing elements occurs along connection paths at Ž different connection strengths called weights. By changing the weight values through . training , an ANN can collectively reproduce the complex overall behavior of a system. There are several different types of ANN based on their architectures and training Ž . learning algorithms. Since we have a classification problem, PNN is considered the Ž most appropriate form of ANN and is used in this study Specht, 1990; Ward Systems . Group, 1995 . Ž . PNN is a feedforward neural network developed by Specht 1990 , in which the response to an input pattern is processed from one layer to the next with no feedback Ž . paths to previous layer s . To provide a general solution to pattern classification, PNN is based upon an approach developed in statistics called Bayesian classifiers. Bayesian classifiers take into account the relative likelihood of events and use a priori information Ž . to improve prediction Specht, 1996 . They provide an optimum approach to classifica- tion in terms of minimizing the expected risk of wrongly classifying an object, and the estimator gets closer to the true underlying class density functions as the number of training samples increases. Since the underlying class density function is unknown, PNN Ž . relies on a class of probability density function PDF estimators, developed by Parzen and extended by Cacullos, which asymptotically approaches the underlying class density Ž . as long as it is continuous Specht, 1996 . Ž . A PNN often has three layers: input, pattern hidden , and output layers. The number of elements in the input layer is equal the number of separable parameters needed to describe the objects to be classified. In our case, the number of input elements corresponds to the 68 variables describing the farm site selection and design, and farm management practices of the 480 Vietnamese shrimp farms. A scale function is often used to normalize the input vector, if the inputs are not already normalized before they enter the network. In the pattern layer, the training set is organized such that each input vector is represented by an individual processing element. The pattern layer essentially comprises the Bayesian classifier, in which the unknown underlying class density functions are estimated through a non-parametric approach using the PDF estimators described above. The output layer has as many processing elements as there are classes to be classified. In our case, the output layer would have only two classes, farms with disease outbreak and farms with no disease outbreak. More details on PNN can be found Ž . in Specht 1990, 1996 . The PNN used in this analysis is from the ‘‘NeuroShell2’’ Ž . package developed by the Ward Systems Group 1995 . 2.3. Data The data used in this paper are a part of a large-scale survey of almost 5000 shrimp farms in 16 countries conducted in 1994r1995 by NACA and ADB under a regional technical assistance program. Detailed on-farm surveys were conducted in each country assisted by a common questionnaire. The shrimp farming questionnaire has questions grouped into five sections: site description, farming system, problem analysis, eco- nomics, and social factors. Site description includes information about age of farm, nature of aquaculture activities, type of land use, soil type, operation, water source, and site-selecting considerations. In the section about farming system, information on shrimp species, and farming techniques were collected. The third section identified problems related to water and sediment, diseases, and their consequences. The economic section gathers information about inputs, costs, revenue, and production and profit trends. The final section identified social aspects of conflict and resolution. In this analysis, 480 shrimp farms in Vietnam, including 86 semi-intensive and 394 extensive farms, are used. With the purpose of analyzing the cause–effect relationship of environmental and management factors tied to aquaculture disease outbreaks, only information in the first three sections of the questionnaire: site description, farming system, and problem analysis, are used. Data were randomly divided into two sets: an estimation set with 369 observations Ž . about three-quarters of the whole data set used to develop the logistic regression model and the PNN model, and a validation set with 111 observations. The partition of the data was arbitrary, balancing the need to have enough data for parameter estimation in the training data set while maintaining a reasonable number of observations for validation.

3. Results