species, and farming techniques were collected. The third section identified problems related to water and sediment, diseases, and their consequences. The economic section
gathers information about inputs, costs, revenue, and production and profit trends. The final section identified social aspects of conflict and resolution. In this analysis, 480
shrimp farms in Vietnam, including 86 semi-intensive and 394 extensive farms, are used. With the purpose of analyzing the cause–effect relationship of environmental and
management factors tied to aquaculture disease outbreaks, only information in the first three sections of the questionnaire: site description, farming system, and problem
analysis, are used.
Data were randomly divided into two sets: an estimation set with 369 observations Ž
. about three-quarters of the whole data set used to develop the logistic regression model
and the PNN model, and a validation set with 111 observations. The partition of the data was arbitrary, balancing the need to have enough data for parameter estimation in the
training data set while maintaining a reasonable number of observations for validation.
3. Results
3.1. Logistic regression The logistic regression was estimated using both forward and backward stepwise
procedures with 68 variables that consist of 16 continuous and 52 categorical variables Ž
. a complete listing of all the variables can be found in the Appendix . The categorical
variables with n attributes were converted into n y 1 binary variables in estimation and they were forced into or out of the regression collectively in one step. The Wald
statistics was used for selecting variables to enter and leave the regression. The significance level for entering was set at 0.05 and for deletion at 0.10. The backward
procedure is generally considered to be more preferable since the forward approach might exclude some important variables from the model. However, in our case, the
results of both the backward and forward approach were similar in terms of the variables selected and the predictive accuracy. In the end, we decided to use the results of the
forward approach, as several variables selected with the backward approach were exhibiting the wrong signs and not easily interpretable. Six variables were chosen in the
Ž .
final model Table 1 . All of them are categorical variables.
2
Ž .
The model x value of 204.42, is statistically significant
P s 0.0000 , implying that the estimated model, containing the constant and the six explanatory variables, fits
the data. In other words, there is a significant relationship between the logarithm of odds of a disease occurrence with the explanatory variables. Coefficients of all six selected
variables are significant at the 1 level except that for the variable WATER-SOURCE, there are no significant differences whether the water came directly from sea or through
a canal as compared to water from a saltwater creek. The parameter estimates also suggest that, as expected, the effects of POLYCULTURE, DRY POND, and SITE-
SELECTION on the logarithm of the odds of a disease occurrence are negative, and the
Table 1 Results of the logistic regression model
a b
Ž .
Variables Estimates of b
Standard error P-value
Exp b Probability
POLYCULTURE y1.0961
0.3055 0.0003
0.3342 0.250
DRY POND y1.0393
0.3957 0.0086
0.3537 0.261
IrD-CANAL 1.3984
0.3560 0.0001
4.0486 0.802
WATER-SOURCE: 0.0002
—EstuaryrRiver y2.1598
0.5288 0.0000
0.1154 0.103
—Direct-from-sea 0.0606
0.6094 0.9208
1.0625 0.515
—Canal-from-sea y0.0479
0.3884 0.9018
0.9532 0.488
—Other y1.3885
0.5887 0.0184
0.2495 0.200
SITE-SELECTION y1.6936
0.4026 0.0000
0.1839 0.155
SILT-DEPOSIT 1.0638
0.3001 0.0004
2.8975 0.743
Constant 1.1428
0.5079 0.0244
a
Model x
2
s 204.42.
b
Variables: — POLYCULTURE: yes s1, 0 otherwise; — DRY POND: yes s1, 0 otherwise; — IrD-CANAL: water discharge into intakerdrainage canal; yes s1, 0 otherwise; — WATER-SOURCE: the
main saltrbrackish water source. The effect of the four categories in the table are compared to the category of ‘Saltwater creek’. — SITE-SELECTION: site selection to avoid impacts of other users; yes s1, 0 otherwise;
— SILT-DEPOSIT: deposit silt on-farm; yes s1, 0 otherwise.
effects of SILT-DEPOSIT and IrD-CANAL are positive. IrD-CANAL and SILT-DE- POSIT are the two most influential positive variables affecting the odds of a disease
occurrence. The logarithm of the odds of a disease occurrence, after controlling for the effects of other variables, increases by 1.40 and 1.06, for the farms that discharge water
into intakerdrainage canal and deposit silt on-farm, respectively. Restated, after control- ling for all other variables, the odds of a disease occurrence increases by 4.05 and 2.90
times for the farms that discharge water into intakerdrainage canal and deposit silt on-farm, respectively. Table 1 also provides the estimated probability of disease
occurrence for each explanatory variable when all the other variables are set at 0. For example, the chance of a disease occurrence for farms that discharge water into an
intake or drainage canal is about 80 if the farms do not practice polyculture, do not dry ponds, do not exercise careful site selection, do not deposit silt on-farm and obtain their
water from saltwater creek. The estimated probability will be higher or lower depending
Table 2 Classification accuracy of the logistic regression model
0 denotes ‘‘no disease occurrence’’, 1 denotes ‘‘disease occurrence’’. Estimation subset
Validation subset Predicted
Percent correct Predicted
Percent correct 1
1 Observed
162 33
83.08 45
13 77.59
1 30
144 82.76
9 44
83.02 Overall
82.93 80.18
Table 3 Classification accuracy of the PNN model, using full set of input variables
0 denotes ‘‘no disease occurrence’’, 1 denotes ‘‘disease occurrence’’. Estimation subset
Validation subset Predicted
Percent correct Predicted
Percent correct 1
1 Observed
179 16
91.79 50
8 86.21
1 17
157 90.23
7 46
86.79 Overall
91.06 86.49
on the combination of values of all the other explanatory variables. Similarly, the chance of a disease occurrence is about 74 for farms depositing silt on-farm. On the other
hand, the chance of a disease occurrence is quite low, 16, 25, and 26 for farms which exercise careful site selection, practice polyculture, and dry ponds, respectively.
Farms that obtain their water from river or estuary seem to have a lower chance of disease occurrence as compared to those obtaining their water from a saltwater creek,
directly from the sea or through a canal from the sea.
The estimated logistic regression model was then applied to the estimation and validation data sets. The predictive accuracy as applied to each of the data set is shown
in Table 2. The table shows the number of farms predicted to have disease outbreak, i.e., farms with estimated probability of disease occurrence of more than 0.5. The estimated
model appears to have good predictive power, correctly classifying 82.93 and 80.18 of the observations in the estimation and validation subsets, respectively.
3.2. Probabilistic neural network PNN First we constructed a PNN model using the same estimation data set as in the
logistic regression procedure. Then the PNN model was applied to the estimation and validation subsets. Its prediction accuracy is shown in Table 3.
Table 4 Classification accuracy of the PNN model, using the same six input variables as in the final logistic regression
model 0 denotes ‘‘no disease occurrence’’, 1 denotes ‘‘disease occurrence’’.
Estimation subset Validation subset
Predicted Percent correct
Predicted Percent correct
1 1
Observed 145
50 74.36
41 17
70.69 1
24 150
86.21 12
41 77.36
Overall 79.95
73.87
Recall that only six variables were chosen in the final logistic regression model. These same six variables were used to build another PNN model. Table 4 shows the
classification accuracy of this PNN model on the estimation and validation subsets.
4. Discussion