Directory UMM :Data Elmu:jurnal:A:Agricultural Systems:Vol64.Issue1.Apr2000:

Agricultural Systems 64 (2000) 37±53
www.elsevier.com/locate/agsy

Statistical methods for evaluating a crop
nitrogen simulation model, N_ABLE
J. Yang a,*, D.J. Greenwood b, D.L. Rowell a,
G.A. Wadsworth a, I.G. Burns b
a

Department of Soil Science, The University of Reading, Whiteknights, PO Box 233, Reading RG6 6DW, UK
b
Department of Soil and Environment Sciences, Horticulture Research International, Wellesbourne,
Warwick, CV35 9EF, UK

Abstract
Modelling nitrogen (N) dynamics is a valuable tool to predict the uptake, mobility and
leaching of mineral N in soil pro®les. Many crop/soil N simulation models have been developed in the last 20 years for this purpose. However, methods for the operational evaluation of
simulation models are not well established. Standard test statistics such as the F- and t-tests
are being questioned because they may give di€erent conclusions. Di€erence measures are
being used as alternatives but they have not been thoroughly investigated. This paper reviews
statistical methods that may be helpful in comparing simulations with measured variables.

They have been used to analyse comparisons of simulations produced by a nitrogen response
model, N_ABLE, with measurements made in a ®eld experiment with lettuce. The techniques
include: (1) data transformation, (2) regression analysis and (3) analysis of di€erence. They
show that accuracy of prediction for di€erent variables varied in di€erent growth periods.
Mean absolute errors (MAE) were within 0.5% and 35 kg haÿ1 of the measured values of percent-N and soil mineral N, respectively, over the whole growth period. They also demonstrate
systematic errors when crop weights exceeded 2 t haÿ1, with 0±0.38 t haÿ1 under-estimation of
dry weight and 3±16 kg haÿ1 under-estimation of N uptake when the harvest date was set at
values between 42 and 61 days. Use of regression analysis showed that the data sets all violated normality and equal variance assumptions and that choosing the right transformation
before analysis was crucially important. When di€erence measures were used, it was found
that there was strong correlation between the outcome of some, but not others. Overall, the
results suggest that the tests can be grouped according to the degree of correlation between
individual tests and that only one test needs to be made from each correlated group. We
suggest that two sets of four statistics can be used, each statistic explaining a special property
of the data. Each set leads to useful conclusions. The ®rst set is mean of error (E), root mean
square error, modi®ed forecasting eciency and paired t-statistic, and the second set is E,
MAE, forecasting coecient and F-ratio of lack of ®t over experimental error (F(Y=X)
). Either
LF
* Corresponding author.
0308-521X/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved.

PII: S0308-521X(00)00010-X

38

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

set can give the same conclusions which could not be quantitatively detected by graphical
inspection of the experimental data. # 2000 Elsevier Science Ltd. All rights reserved.
Keywords: Statistical evaluation; Test statistics; Di€erence measures; Nitrogen simulation model,
N_ABLE; Nitrogen uptake; Soil mineral N

1. Introduction
Operational evaluation is the assessment of accuracy and precision of a simulation
model and methods vary from inspection and demonstration to analytical test
(Knepell and Arangno, 1993). In general, what we most require to know is what
proportion of the treatment variation, i.e. that excluding experimental error, can be
accounted for by the model. We also need to know about biases the model has over
the entire response surface, i.e. in this case yield against N fertiliser rate and time. At
a more detailed level it may be helpful for us to know: (1) In what regions of the
response surface are the discrepancies between simulated and measured variables

greatest and over what regions are they smallest? (2) Does the model generally give
the right shape of response surface? Do the simulated values simply di€er from the
measured values by a ®xed amount or are they a ®xed proportion of the measured
values? (3) Does the model give transformed or un-transformed values that are linearly related to the measured values?
The purpose of this paper is to discuss a range of statistical techniques for answering
these questions. Two types of statistical methods have commonly been used in model
evaluation: test statistics and di€erence measures. Test statistics (both parametric and
non-parametric) have been reviewed by Reckhow et al. (1990) and O'Leary and Connor (1996), and comprehensive discussions on di€erence measures have been given by
Willmott et al. (1985), Loague and Green (1991) and Kabat et al. (1995). It seems that
there is no robust statistic which can be used for all models because of the complexity of
the data structure of a model, e.g. continuous or discrete, error or error-free and one- or
two-dimensional. In this paper, we try to select some available statistical methods both
from test statistics and from di€erence measures for the evaluation of the model,
N_ABLE (Greenwood and Draycott, 1989a, b; Greenwood et al. 1996) using four
measured data sets from a nitrogen experiment with lettuce.

2. Data transformation
2.1. Experiment and data structure
A nitrogen experiment with lettuce was carried out on light sandy loam soil during
April±July 1997 in the South of England (51 270 N, 0 560 W). The details of the

experiment have been given by Yang et al. (1999) and a brief description of the data
structure follows. In the experiment, plant and soil were sampled after transplanting
in six N treatments each with three replicate plots to monitor plant growth and soil

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

39

mineral N on a total of seven dates on Days 12, 19, 29, 42, 50, 57 and 61 in each plot
during the growth period. Four state variables were measured regularly during the
experiment in each of six N treatments with lettuce (Yang et al., 1999). The state
variables are: weight of dry matter excluding ®brous roots (WDM, t haÿ1), soil
mineral N in the 0±30 cm layer (NS, kg haÿ1), N uptake excluding ®brous roots (NU,
kg haÿ1) and percent N in dry matter (PN,%).
The above data were simulated with the model N_ABLE. The model calculates
the daily changes in the distributions of mineral-N and water down the soil pro®le
together with the increments in plant dry weight and N-uptake. The major inputs to
the model are level and timing of fertiliser application, the maximum potential yield
of dry matter, and the daily rainfall and potential evaporation together with the
initial soil characteristics. Detailed values of the input parameters were given by

Yang et al. (1999). It was found that output is sensitive to the harvest time parameter (Th), i.e. the date on which the simulation is stopped. So the e€ect of changing
Th was examined in the simulation by setting Th to the sample Days 42, 50, 57 or 61,
producing four data sets for each of the four variables.
The following notation is used in the data analysis later. A measured variable is
represented by Y and a simulated variable by X, while measured data are denoted by
yij (i=1, 2 . . . n represents the total measurements on each data set (see later example) and j=1, 2 . . . r represents replicates) and simulated data by xi, because the
model cannot simulate random experimental errors for each replicate. Then the
di€erence between simulation and measurement can be de®ned as:
DˆYÿX
dij ˆ yij ÿ xi

…1†
i ˆ 1; 2 . . . n;

j ˆ 1; 2 . . . r

where yij is the measured value for the ith measurement in the jth replicate and xi is
the simulated value for the ith measurement. The data set was prepared as follows
for WDM, NU, PN and NS. There were measured values and simulated values. At
each of the harvest dates there were six fertiliser treatments and three replicates. The

measured data for the ®rst harvest date were added to that for the second and so on.
As there were harvests at 12, 19, 29 and 42 days there are a total of 6 (treatments)3
(replicates)4 (measurements) measured values at Th=42 days. For each level of
fertiliser at 12, 19, 29 and 42 days, the simulated values were obtained by running
the model with the input value of the maximum yield Wmax=2.5 at Th=42 days.
This produced a total of 6 (treatments)4 (measurements) simulated values (i.e. no
simulated random errors for replicates). These measured values plus those made at
50 days (i.e. a total of 635 values) were compared with the model values (i.e. a
total of 65 values) simulated with the input value corresponding to the maximum
yield Wmax=3.0 at Th=50 days. Likewise the measured data sets at 57 days (i.e. a
total of 636 values) were prepared for the simulated values (i.e. a total of 66
values) simulated with the maximum yield Wmax=3.5 at Th=57. Finally, the measured data sets at 61 days (i.e. a total of 637 values) were prepared for the simulated values (i.e. a total of 67 values) simulated with the ®nal harvested maximum

40

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

yield Wmax=4.3 at the ®nal harvest Th=61 days. Thus there is an overlap in the measured but not the simulated values for the di€erent times. The magnitude of variables
can be in¯uenced by N levels and by the duration of growth. Graphs of simulated NU
and PN and measured data against time of growth are shown in Figs. 1 and 2. Graphs

of measured WDM and NS were shown in a previous paper (Yang et al., 1999).
2.2. Test of normality and homoskedasticity
Regression methods are frequently used in model testing and validation processes
(Sutherland et al., 1986; Reckhow et al., 1990; O'Leary and Connor, 1996) and
provided one of the methods used in this paper (see later for details). In the linear
regression model:
Yi ˆ ‡ Xi ‡ "i

…2†

the t-test can be used to test H0: a=0, b=1 since ta ˆ …a ÿ †=Sa  t (Nÿ2) and
tb=(b ÿ )/Sbt (Nÿ2), where a and b are least squares measures of and
(Aigner, 1971). However, for the classical least squares estimation in Eq. (2), the
signi®cant test of the regression parameters assumes that the individual error terms
ei are normally distributed, are of equal variance, are independent of each other, and

Fig. 1. Simulated and measured above ground N uptake (means of three plots and the standard error) for
di€erent fertiliser-N treatments. Symbols (*) are experimental values; lines are simulated values. The four
lines for each treatment represent simulated values for Wmax=4.3, 3.5, 3.0 and 2.5 at Th=61, 57, 50 and 42
(from the bottom to the top), respectively. N0, N30/20, NA, 80/50/50, N180/60/70, NB, 180/60, NC, 70/120/50 refer to

di€erent N fertiliser treatments where the numbers in subscript refer to basal and top dressings (kg N haÿ1).

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

41

Fig. 2. Simulated and measured percent-N in dry matter (means of three plots and the standard error)
with di€erent fertiliser-N treatments. Symbols (*) are experimental values; lines are simulated values.
The four lines for each treatment are simulated values for Wmax=4.3, 3.5, 3.0 and 2.5 at Th=61, 57, 50
and 42 (from the top to the bottom), respectively.

are not correlated with the independent variable, Xi. In our research, transformations are generally required because the range in response variables is large. For
example, the largest values of WDM, NU and NS are all 10 or 100 times more than
the smallest values. Diagnostic plots for WDM, NU, PN and NS produced evidence
that these data sets seriously violated the assumption of equal variance, and slightly
violated the assumption of normality (Yang, 1999). Violations of these two
assumptions may cause biased estimates of the parameter variances in the regression
analysis. Shapiro and Wilk's W-test is considered to be the best multi-functional test
of normality (Shapiro and Wilk, 1965; Royston, 1982), whereas White's w2-test was
thought to be a good statistic to check heteroskedasticity of ei since it was derived

from a null hypothesis that not only are the errors homoskedastic, but they are also
independent of the regressors (White, 1980).
2.3. Transformation for normality and homoskedasticity
Common assumptions for statistical analyses have been discussed in detail for
water quality models (Reckhow et al., 1990), in biological research (Bowker, 1993),
and in agricultural research (Whitmore, 1991; O'Leary and Connor, 1996).
Although the logarithmic transformation is the most frequently used method (Sokal
and Rohif, 1987), it assumes that the original relationship is Y=aebX which is not the
same as our assumption [Eq. (2)]. In order to get a better transformation to ®t the

42

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

standard linear model shown in Eq. (2), there is a need to stabilise error variances.
Two frequently used transformation methods were chosen for use in this paper
based on the methods discussed by Pindyck (1981) and Bowker (1993). They are:
1. Model Yi ˆ ‡ Xi ‡ "i was transformed by 1/Si, and the model becomes:
Yi =Si ˆ …1=Si † ‡ …Xi =Si † ‡ "i =Si


…3†

p
where Si is the sample standard derivation, i.e. Si ˆ ‰…yij ÿ yi †2 =…r ÿ 1†Š, and yi is
the mean value of the ith measured data. Eq. (3) shows that Si will be measured with
experimental errors, meaning that the greater the number of the replicates, the more
stable the standard errors.
2. Model Yi ˆ ‡ Xi ‡ "i was transformed by 1=X=2
i , and the model becomes:
1ÿ=2
ˆ …1=X=2
† ‡ "i =X=2
Yi =X=2
i
i † ‡ …Xi
i

…4†

In ®tting Eq. (4), d/2 was set to 1.25, 1, 0.75, 0.5 and 0.25 based on our ®nding

from diagnostic plots that residual error ei is proportionally related to the independent variable X/2
i . Eqs. (3) and (4) were ®tted by the weighted regression method
using SAS software (SAS Institute, 1989, 1990a, b).
To select the best transformation, W and w2 statistics of random errors in Eqs. (3)
and (4) were ®rst calculated for each data set at Th=42, 50, 57 and 61 days individually, and then mean values of W and w2 for di€erent Th values were compared.
The ideal transformation is one where both the minimum w2 and the maximum W
occurred simultaneously, but this condition was only roughly met by the NS data set
(Table 1). When this condition did not hold, we identi®ed the minimum value of w2
as our selected transformation. Following this rule, we concluded that the 1/Xi
transformation gave the best correction for equal variance in WDM, NU and PN data
gave the best correction for both norsets with slight loss of normality, and 1/X0.5
i
mality and equal variance in NS data sets (Table 1).
2.4. Transformation of normality of data set D (YÿX)
When the paired t-statistic [Eq. (11)] is used to test the di€erence between measured and simulated data D, the data set D should be a normal variable with a null
hypothesis H0: d ˆ y ÿ x ˆ 0: Our research shows that all D data sets derived from
each variable WDM, NU, PN and NS violated normality. Again W statistics of data
sets D were calculated to select the best transformation from them. Following the
same rule as the method given earlier, we concluded that 1/Xi transformation gives
the best correction of normality of D data sets generated from WDM and NU, 1/X0.5
i
gives the best correction for PN, and 1/X0.25
gives the best correction for NS data
i
(data not shown).

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

43

Table 1
Comparison of di€erent transformations of data for testing normality and homoskedasticity using mean
values (Th=42, 50, 57 and 61) of W and White-w2 statistics for each state variablea
State
variables

Transfer
variables

W (P>W) H0; normal
of residual

w2 (P>w2) H0; homoskedastic
of residual

WDM

No-transfer
1/Si
1/X1.25
i
1/Xi
1/X0.75
i
1/X0.5
i
1/X0.25
i

0.9118 (0.00)
0.8838 (0.00)
0.8835 (0.00)
0.8722 (0.00)
0.9152
0.9222 (0.01)
0.9350 (0.01)

21.25 (0.00)
10.06 (0.03)
9.30 (0.03)
4.52 (0.42)
17.60 (0.00)
20.19 (0.00)
18.95 (0.00)

NU

No-transfer
1/Si
1/X1.25
i
1/Xi
1/X0.75
i
1/X0.5
i
1/X0.25
i

0.9236
0.8968 (0.00)
0.8835 (0.00)
0.8991 (0.00)
0.9152 (0.00)
0.9276 (0.01)
0.9350 (0.01)

18.10 (0.00)
10.23 (0.03)
9.29 (0.03)
8.77 (0.03)
17.59 (0.00)
18.37 (0.00)
18.95 (0.00)

PN

No-transfer
1/Si
1/X1.25
i
1/Xi
1/X0.75
i
1/X0.5
i
1/X0.25
i

0.9711 (0.26)
0.9756 (0.40)
0.9635 (0.18)
0.9650 (0.19)
0.9665 (0.20)
0.9681 (0.22)
0.9696 (0.23)

11.49 (0.01)
6.36 (0.17)
8.37 (0.05)
5.99 (0.10)
9.49 (0.03)
9.59 (0.01)
12.40 (0.01)

NS

No-transfer
1/Si
1/X1.25
i
1/Xi
1/X0.75
i
1/X0.5
i
1/X0.25
i

0.9571 (0.13)
0.9491 (0.03)
0.9660 (0.21)
0.9636 (0.14)
0.9619 (0.14)
0.9604 (0.15)
0.9590 (0.14)

9.78 (0.01)
9.82 (0.02)
11.33 (0.01)
9.35 (0.01)
8.59 (0.06)
3.59 (0.23)
5.55 (0.19)

3. Regression analysis
3.1. Testing of the null hypothesis …H0 † : ˆ 0; ˆ 1
Linear regression was therefore carried out using Eq. (2) with the weighted variable
1/Xi for WDM, NU and PN data sets, and with the weighted variable 1/Xi0.5 for NS data
set, and the t-test used to test H0: =0, =1 in the model. In our data, N=nr, and a
and b are weighted least squares measures of and . Calculated t-statistics are:
ta ˆ … ÿ 0†=sa

sa ˆ

tb ˆ …b ÿ 1†=sb

sb ˆ

p
p

…MSE =nr ‡ x2 s2b †

…5†

‰MSE =…xi ÿ x†2 Š

…6†

44

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

where MSE is the mean square of experimental error, calculated as shown later.
Signi®cance of the values of coecients a and b were tested by comparing ta and tb
with the signi®cance level of t0.05 (nrÿ2) or t0.01 (nrÿ2).
3.2. Lack of ®t test
In regression analysis with replicates of dependent variables, the lack of ®t test
was carried out by the F ratio of MSLF over MSE. Values of both mean squares were
obtained as follows. The sums of squares of residuals (SSR) were partitioned into
two parts, the sums of squares of the randomised error (SSE) and the sums of
squares of the lack of ®t (SSLF). The degrees of freedom of residuals (DFR) were
then partitioned into the degree of freedom of randomised error (DFE) and the
degree of freedom of lack of ®t (DFLF), where:
SSR ˆ SSE ‡ SSLF
SSR ˆ …yij ÿ a ÿ bxi †2
SSE ˆ …yij ÿ yi †2
SSLF ˆ SSR ÿ SSE

DFR ˆ DFE ‡ DFLF
DFR ˆ …nr ÿ 2†
DFE ˆ n…r ÿ 1†
DFLF ˆ …n ÿ 2†

MSLF ˆ …SSLF =DFLF †
MSE ˆ …SSE =DFE †

…7†

…8†

and where yi is the measured mean of the ith measurement. The variance ratio is:
FLF ˆ MSLF =MSE

…9†

Statistical inferences from the above analysis of variance of residuals were drawn by
comparing FLF with the signi®cance level of F0.05 (DFLF, DFE) or F0.01 (DFLF, DFE).
3.3. Goodness of ®t test
R2 is a commonly used statistic for testing the goodness of ®t and is de®ned as:
R2 ˆ SSU =SSY ˆ 1 ÿ …SSR =SSY †

…10†

Following the earlier method, the results of regression with a lack of ®t test on
simulated and measured data from the four state variables are listed in Table 2.
Graphical displays of Y against X and regression lines for NU and PN are shown in
Fig. 3 and 4.

4. Analysis of di€erence
Regression analysis in model evaluation actually tests its precision, i.e. the degree to
which simulated values approach a linear function of the measured data (Willmott et

45

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

Table 2
Weighted linear regression, Y=a+bX, with a lack of ®t test on simulated data (xi) and measured data yija
Variablesb

Th

WDM

42
50
57
61

NU

42
50
57
61

PN

42
50
57
61

NS

42
50
57
61

a
b

b

FLF

R2

0.93**
1.06**
1.29**
1.36**

3.60**
2.75**
4.60**
7.39**

0.9498
0.9591
0.9515
0.9373

1.04
1.16**
1.33**
1.38**

4.42**
3.68**
6.34**
8.30**

0.9448
0.9567
0.9408
0.9236

1.19**
0.64**
0.10
0.32*

0.78**
0.88*
0.97
0.91*

6.92**
4.76**
5.12**
3.69**

0.3768
0.6258
0.7369
0.7641

4.11
6.77
6.25
9.50**

1.17**
1.15**
1.09
1.03

2.27*
2.67**
2.25**
4.09**

0.8607
0.8228
0.8131
0.7824

a
0.007*
0.007*
0.006
0.009*
ÿ0.42*
0.175
ÿ0.12
0.074

*,** Refer to 0.05 and 0.01 signi®cance levels.
WDM, NU and PN data were weighted by 1/Xi for the regression; NS data were weighted by 1=X0:5
i for the regression.

al., 1985). In model testing, however, our interest is in the model's accuracy, i.e. the
extent to which simulated values approach the measured data. This can be achieved
by examining the di€erence directly from dij=yijÿxi. It is equal to the test of the
goodness of ®t of model Y=X, but not Y=a+bX. In this circumstance, the regression method with a=0 or b=1 uses the following equation to calculate R2, e.g.
R2=1ÿ(residual sum of squares/uncorrected total sum of squares) (SAS Institute,
1990b), to avoid situations where R2>1 or R2PN (Table 2),
indicating that the poorest linear relationship detected is for PN when Th is set to 42
days (Fig. 4a).
Table 3 provides detailed information for assessing the di€erence between measured and simulated values. Four paired t-values in Table 3 are not signi®cant at the
0.05 level, indicating that there are no statistically signi®cant di€erences between

50

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

means of simulated and measured data. This conclusion is valuable because it cannot be deduced from a graphical display. It shows that all the di€erences between
measured and simulated values in these regions of the response surfaces can be
attributed to experimental error. All signi®cant values of the paired t and F(Y=X)
LF
values indicate that further modi®cation of the model is needed. This focuses attention on regions of WDM and NU for Th=61 in the ®rst instance because of the
and paired t-values. F(Y=X)
values calculated in Table 3
higher values of both F(Y=X)
LF
LF
were based on the model Y=X, indicating the extent to which the di€erences
(yijÿxi) di€er from experimental error. By comparison, the values of FLF in Table
2 were based on the model Y=a+bX, indicating the extent to which the residual
errors (yijÿaÿbxi) di€ered from experimental error. For this reason, the F(Y=X)
LF
values are the real measures of the accuracy of model simulation.
Most of the E values are negative (Table 3), indicating that the model generally
underestimated the measured data except for PN at Days 57 and 61 and WDM at
Day 42. The magnitude of underestimation increased with time for the WDM and
NU variables, but the reverse is true for PN and NS variables. MAE and RMSE
showed that the largest di€erence occurred for NU when Th=61. C values indicate
that the largest relative error is NU>NS>PN ranked by values of EF and EF1 (Table
3), giving a similar result to that drawn from R2 in Table 2.
It has been found that the di€erence statistics in Table 3 are strongly correlated
with the outcome of some tests but not others. The tests can be grouped according
to the degree of correlation (Table 4). Thus, RMSE, MAE and C were strongly
correlated with one another, but not with any of the others. Likewise, paired t and
were strongly correlated, but not with the other tests as were EF and EF1. A
F(Y=X)
LF
single linkage dendrogram for the correlation matrix (rij>0) in Table 4, for eight
di€erence statistics is shown in Fig. 5.
Table 4 shows that values of E have a strong negative correlation with C, MAE
and RMSE. In this study, the negative correlations result from negative E values
which are caused by X>Y in Eq. (1). In contrast, the other seven statistics are all

Table 4
Correlation coecient matrix rij for eight di€erence statistics listed in Table 3a

F(Y=X)
LF
Paired t
E
MAE
RMSE
C
EF
EF1
a

F(Y=X)
LF

Paired t

1.00
0.89
0.07
ÿ0.22
ÿ0.22
0.23
0.08
0.08

1.00
ÿ0.15
0.02
0.02
0.51
0.26
0.28

E

MAE

RMSE

C

EF

EF1

1.00
ÿ0.96
ÿ0.96
ÿ0.83
ÿ0.02
ÿ0.05

1.00
0.99
0.82
0.01
0.03

1.00
0.82
0.01
0.04

1.00
0.29
0.34

1.00
0.99

1.00

Values with bold font represent correlation coecients among each group.

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

51

Fig. 5. Single linkage dendrogram for the correlation matrix rij (>0) in Table 4 for eight di€erence
statistics.

positive irrespective of whether the di€erence D=(YÿX) is negative or positive. The
correlations between E and C, MAE and RMSE are not universally applicable and for
this reason the negative correlation has not been used in the cluster analysis in Fig. 5.
We conclude from Fig. 5 that only one statistic from each group needs be used to
test the validation of the model. Group 1 includes RMSE, MAE and C (grouped by rij
50.82) and provides measures of the magnitude of the di€erence. Group 2 includes
EF and EF1 (grouped by rij50.99), and gives measures the goodness of match or the
and paired t-statistics (grouped
co-ordinate of the di€erence. Group 3 includes F(Y=X)
LF
by rij50.89) and gives a measure of the probability of signi®cant di€erence. Group 4
includes only E and provides a measure of direction of di€erence (i.e. under- or overestimation). The maximum correlation coecient between Groups 1 and 2 is very
small (i.e. rij=0.51), and between Groups 3 and 4 is only 0.07 (Table 4).

6. Conclusions
Data from the nitrogen experiment were shown to violate normality and equal
variance assumptions and transformation is necessary to ensure standard statistical
analyses are carried out correctly. Shapiro and Wilk's W-test and White's w2 test
have been found to be useful methods for testing the normality and heteroskedasticity of "i . Two types of transformations were employed in this study to
ensure both normality and equal variance, and 1/Xi was required for the WDM, NU
and PN data, and 1=Xi0:5 was required for the NS data. In addition, transformation
to normalise the data set D (YÿX) was also necessary where the paired t-test was
used. Regression analysis with a lack of ®t test was used to evaluate a simulation
model, but this method proved to be very precise and it should be interpreted carefully. The FLF test is a precise measure for lack of ®t, and is sensitive to both
experimental error and degrees of freedom. Whitmore (1991) suggested that a baltest. Furthermore,
ance of DFLF and DFE should be considered in the F(Y=X)
LF

52

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

Reckhow et al. (1990) suggested that a prediction b0 could be used for testing ˆ b0
but not for testing ˆ 1, where b0 can be decided from a real subject (i.e. b0=1,
jj>0 is an acceptable error). In addition, if the t-test shows that parameters and
do not deviate from H0 when regression analysis is carried out, R2 can be used to
check the goodness of ®t. Thus, if both of these tests are carried out, we can then draw
a ®nal conclusion, whereas a decision based on any one of them can be misleading.
Di€erence measures shown in Eqs. (14)±(19) have advantages over test statistics in
that they are easy to interpret and do not need data transformation, but the methods
are not fully developed. Many authors believe that there is no robust statistic which
can be used to draw conclusions in model evaluations and therefore several methods
need to be used together to give a double check. In this study, we found that eight
di€erence statistics (Table 3) were strongly correlated with each other allowing one
to be chosen from each correlated group so as to save time without losing accuracy.
We suggest that the same conclusion can be reached by using either RMSE, EF1,
and E as the di€erence measures because each
paired t and E or MAE, EF, F(Y=X)
LF
of them is present in Groups 1, 2, 3 and 4 (Fig. 5). In addition, graphical display
should be used as a visual aid for comparison between simulation and measurement
and for interpretation of the statistics used in the testing and evaluation processes.
Acknowledgements
Financial support through an ORS award from UK is gratefully acknowledged.
We also thank Andrew Mead from Horticulture Research International, UK, for his
statistical comments on the ®rst draft of this paper.
References
Addiscott, T.M., Whitmore, A.P., 1987. Computer simulation of changes of soil mineral nitrogen and crop
nitrogen during autumn, winter and spring. Journal of Agricultural Science, Cambridge 109, 141±157.
Aigner, D.J., 1971. Basic Eeconometrics. Prentice-Hall, Englewood Cli€s, NJ.
Bowker, D.W., 1993. Dynamic models of homogeneous systems. In: Fry, J.C. (Ed.), Biological Data
Analysis: A Practical Approach. Oxford University Press, Oxford, pp. 313±343.
Greenwood, D.J., Draycott, A., 1989a. Experimental validation of an N-response model for widely different crops. Fertiliser Research 18, 153±174.
Greenwood, D.J., Draycott, A., 1989b. Quantitative relationships for growth and N content of di€erent
vegetable crops grown with and without ample fertiliser-N on the same soil. Fertiliser Research 18, 175±188.
Greenwood, D.J., Rahn, C.R., Draycott, A., Vaidyanathan, L.V., Paterson, C., 1996. Modelling and
measurement of the e€ects of fertilizer-N and crop residue incorporation on N-dynamics in vegetable
cropping. Soil Use and Management 12, 13±24.
Kabat, P., Marshall, B., van den Broek, B.J., 1995. Comparison of simulation results and evaluation of
parameterization schemes. In: Katab, P., Marshall, B., Van den Broek, B.J., Vos, J., Van Keulen, H.
(Eds.), Modelling and Parameterization of the Soil-plant-atmosphere System. A Comparison of Potato
Growth Models. Wageningen Pers, Wageningen, pp. 439±501.
Klepper, O., Rouse, D.I., 1991. A procedure to reduce parameter uncertainty for complex models by
comparison with real system output illustrated on a potato growth model. Agricultural Systems 36,
375±395.

J. Yang et al. / Agricultural Systems 64 (2000) 37±53

53

Knepell, P.L., Arangno, D.C., 1993. Simulation Validation: A Con®dence Assessment Methodology.
IEEE Computer Society Press, Los Alamitos, CA.
Loague, K.M., Freeze, R.A., 1985. A comparison of rainfall-runo€ modelling techniques on small upland
catchments. Water Resources Research 21, 329±348.
Loague, K., Green, R.E., 1991. Statistical and graphical methods for evaluating solute transport models:
overview and application. Journal of Contaminant Hydrology 7, 51±73.
O'Leary, G.J., Connor, D.J., 1996. A simulation model of the wheat crop in response to water and
nitrogen supply: II. Model validation. Agricultural Systems 52, 31±55.
Pindyck, R.S., 1981. Econometric Models and Economic Forecasts. McGraw-Hill Inc, USA.
Reckhow, K.H., Clements, J.T., Dodd, R.C., 1990. Statistical evaluation of mechanistic water-quality
models. Journal of Environmental Engineering 116, 250±268.
Royston, J.P., 1982. An extension of Shapiro and Wilk's W Test for Normality to large samples. Applied
Statistics 31, 115±124.
SAS Institute Inc, 1989. SAS/STAT User Guide, Version 6, 4th Edition, Volume 2. SAS Institute, Cary,
NC.
SAS Institute Inc, 1990aa. SAS Procedures Guide. Version 6, 3rd Edition. Cary, NC, USA.
SAS Institute Inc, 1990bb. SAS/ETS Software, Applications Guide 2, Version 6, First Edition: Econometric Modelling, Simulation, and Forecasting. Cary, NC, USA.
Shapiro, S.S., Wilk, M.B., 1965. An analysis of variance test for normality (complete samples). Biometrika 52, 591±611.
Snedecor, G.W., Cochran, W.G., 1976. Statistical Methods, 6th Edition. The Iowa State University Press,
Ames, IA, USA.
Sokal, R.R., Rohif, F.J., 1987. Introduction to Biostatistics, 2nd Edition. W.H. Freeman, New York.
Sutherland, R.A., Wright, C.C., Verstraeten, L.M.J, Greenwood, D.J., 1986. The de®ciency of the `economic optimum' application for evaluating models which predict crop yield response to nitrogen fertilizer. Fertilizer Research 10, 251±262.
White, H., 1980. A heteroskedasticity-consistent coveriance matrix estimator and a direct test for heteroskedasticity. Econometrica 48, 817±829.
Whitmore, A.P., 1991. A method for assessing the goodness of computer simulation of soil processes.
Journal of Soil Science 42, 289±299.
Willmott, C.J., Ackleson, S.G., Davis, R.E., Feddema, J.J., Klink, K.M., Legates, D.R., O'Donnell, J.,
Rowe, C.M., 1985. Statistics for the evaluation and comparison of models. Journal of Geophysical
Research 90, 8995±9005.
Yang, J., 1999. Testing and evaluation of a nitrogen simulation model, N_ABLE, using independent ®eld
data. PhD thesis, The University of Reading, Reading, UK
Yang J., Wadsworth, G.A., Rowell, D.L., Burns, I.G., 1999. Evaluating a crop nitrogen simulation
model, N_ABLE, using a ®eld experiment with lettuce. Nutrient Cycling in Agroecosystems, 55, 221±
230.