A Discrepancy Based Model Selection Crit (1)

Εζζηνδεσ Σ α δ
Π α

18 υ Πα

δεσ Ιν

δ οτ ο

υ υ

υ

α

(2005)

.485-494

A DISCREPANCY BASED MODEL SELECTION
CRITERION

Kyriacos Mattheou and Alex Karagrigoriou
University of Cyprus

ABSTRACT
The aim of this work is to develop a new criterion of model selection using a general
technique based on measures of discrepancy. The new criterion is constructed using the Power
Divergence introduced by Basu et. al (1998) and is shown to be an asymptotically unbiased
estimator of the expected overall discrepancy between the true and the fitted models.

1. INTRODUCTION
A model selection criterion can be constructed by an approximately unbiased
estimator of an expected “overall discrepancy” (or divergence), a nonnegative
quantity which measures the “distance” between the true model and a fitted
approximating model. A well known divergence is the Kullback-Leibler discrepancy
that was used by Akaike (1973) to develop the Akaike Information Criterion (AIC).
Measures of discrepancy or divergence between two probability distributions have
a long history. A unified analysis was recently provided by Cressie and Read (1984)
who introduced the so called power divergence family of statistics for multinomial
goodness-of-fit tests. The fit of the model for the behaviour of a population can be
assessed by comparing expected ( nπ i ) and observed ( X i ) frequencies using the

family of power divergence statistics

⎡⎛ X ⎞λ ⎤
2
X i ⎢⎜ i ⎟ −1⎥;
(1.1) 2nI ( X / n : π ) =

λ ( λ + 1) i
⎢⎣⎝ nπ i ⎠
⎥⎦
λ

λ ∈ \ \ {0, −1} ,

∑X

i

=n


which consists of the statistics evaluated for all choices of λ in \ \ {0, −1} . Note that

for =1 the statistic (1.1) becomes the well known Pearson’s X2 statistic while for
→ 0 it becomes the loglikelihood ratio statistic G2,

- 485 -

⎡ observedi ⎤
G 2 = 2∑ observedi log ⎢

i
⎣ expectedi ⎦

(1.2)

which coincides with the Kullback-Leibler distance.
The term power divergence describes the fact that the statistic measures the
divergence of the “expected” from the “observed frequencies” through a (weighted)
sum of powers of the term (observed / expected).
Note that in the continuous case, the power divergence between the true and the

hypothesized distribution g and f takes the form [Cressie and Read,1988 p.125]

⎡⎛ g ( z ) ⎞ λ ⎤
1
I ( g, f ) =
g ( z ) ⎢⎜⎜
⎟ − 1⎥ dz
λ ( λ + 1) ∫
⎢⎝ f ( z ) ⎠⎟



λ

(1.3)

which for λ →0 becomes the Kullback-Leibler distance which was used to
constract the AIC criterion as mentioned before.
A new measure of discrepancy was recently introduced by Basu et. al (1998). This
new family of discrepancy measures which is given in (2.1) is referred to as the class

of density power divergences and is indexed by a single parameter a . They measure
the divergence between two densities f and g through the integral of powers of the
terms

{ f ( z ) − g ( z ) f ( z )} & ⎧⎨⎩ 1a g ( z ) − 1a g ( z ) f ( z )⎭⎫⎬ .
1+ a

1+ a

a

a

Observe that for a = 1 , the discrepancy becomes the L2 distance between f and
g while for a → 0 it becomes (as in the case of the Cressie and Read family) the
Kullback-Leibler distance (see Lemma 2.1, Section 2).
In this paper, we develop a new model selection criterion which is shown to be an
approximately unbiased estimator of the expected overall discrepancy that
corresponds to Basu’s density power divergence.


2. POWER DIVERGENCE AND THE EXPECTED OVERALL
DISCREPANCY
In parametric estimation, there are a lot of methods that have been introduced in
order to extract an estimator of the true parameter. Some of them are known as
density-based minimum divergence methods, i.e. estimating methods which are based
on the minimization of some appropriate divergence between the true model g and a
fitted approximating model f . This class of methods includes as special cases, the
classical maximum likelihood method as well as minimum chi-squared methods
based on families of chi-squared distances [Beran (1977), Cressie and Read (1984),
Lindsay (1994)]. One of the most recently proposed discrepancies is Basu’s Power
Divergence between g and f [Basu et. al (1998)] which is defined as:

- 486 -


1
⎛ 1⎞

d a ( g , f ) = ∫ ⎨ f 1+ a ( z ) − ⎜1 + ⎟ g ( z ) f a ( z ) + g 1+ a ( z ) ⎬ dz , a > 0,
a

⎝ a⎠



(2.1)

where g is the true model, f the fitted approximating model, and a a positive
number.
Lemma 2.1. The limit of the divergence (2.1) when a → 0 is the Kullback-Leibler
distance:

⎧⎪ g ( z ) ⎫⎪
d 0 ( g , f ) = lim d a ( g , f ) = ∫ g ( z ) log ⎨
⎬ dz
a →0
⎪⎩ f ( z ) ⎭⎪

d 0 ( g , f ) = lim d a ( g , f )

Proof.


a →0


1
⎛ 1⎞

= lim ∫ ⎨ f 1+ a ( z ) − ⎜ 1 + ⎟ g ( z ) f a ( z ) + g 1+ a ( z ) ⎬ dz
a →0
a
⎝ a⎠


= lim ∫ f
a →0

1+ a

g ( z ) f ( z ) dz + lim ∫
( z ) dz − lim

a →0 ∫
a →0

= ∫ f ( z )dz − ∫ g ( z )dz + ∫ g ( z )

{

= 1 − 1 + ∫ g ( z ) lim g
a →0

a

g ( z ) ( g a ( z ) − f a ( z ))

( g ( z ) − f ( z ) ) dz
lim
a

a


a

dz

a

( z ) log ⎡⎣ g ( z ) ⎤⎦ − f ( z ) log ⎡⎣ f ( z ) ⎤⎦} dz
a →0

a

a

⎧⎪ g ( z ) ⎫⎪
= ∫ g ( z ) log ⎨
⎬ dz.
⎪⎩ f ( z ) ⎪⎭

{ fϑ } ,


Consider a random sample X 1 ,..., X n from the true distribution g and a
candidate model fϑ from a parametric family of models

unknown parameter ϑ ∈ Θ . The following Lemma provides the first derivative of
Basu’s power divergence (2.1) between g and fϑ .
indexed by an

Lemma 2.2. The first derivative of (2.1) with f = fϑ is:

( a + 1) ⎣⎡ ∫ uϑ ( z ) fϑ ( z ) dz − Eg ( uϑ ( Z ) fϑ ( Z ) )⎦⎤ , a > 0
1+ a

where uϑ =

a


( log ( fϑ ) ) and Z a random variable with distribution g .
∂ϑ

- 487 -

Proof.

a +1⎞
a −1
⎟ Eg ( afϑ ( Z ) fϑ′ ( Z ) )
⎝ a ⎠

( a + 1) ∫ fϑ ( z ) fϑ′ ( z ) dz − ⎛⎜
a

1+ a
a
⎡ ∂
⎛ ∂
⎞⎤
log fϑ ( z ) ) fϑ ( z ) dz − Eg ⎜
log fϑ ( Z ) ) fϑ ( Z ) ⎟ ⎥
= ( a + 1) ⎢ ∫
(
(
⎝ ∂ϑ
⎠⎦
⎣ ∂ϑ
1+ a
a
= ( a + 1) ⎡ ∫ uϑ ( z ) fϑ ( z ) dz − Eg uϑ ( Z ) fϑ ( Z ) ⎤ .


ˆ
The minimum density power divergence estimator θ of the parameter ϑ , is

(



generated by minimising
(2.2)
w.r.t to ϑ since n −1

)

n
⎛ 1⎞
fϑ1+ a ( z )dz − ⎜1 + ⎟ n −1 ∑ fϑa ( X i )
⎝ a⎠
i =1

∑ fϑ ( X ) is an estimate of ∫ g ( z ) fϑ ( z )dz ≡ E ( fϑ ( Z ) ) .
n

a

i =1

a

i

a

g

For general families, as it can be easily seen from equation (2.2) and Lemma (2.2),
the estimating equations are of the form

U n (ϑ ) ≡ n −1 ∑ uϑ ( X i ) fϑa ( X i ) − ∫ uϑ ( z ) fϑ1+ a ( z )dz = 0
n

(2.3)

i =1

where uϑ ( z ) = ∂ log fϑ ( z ) / ∂ϑ is the maximum likelihood score function. Note

that this estimating equation is unbiased when g = fϑ .
Some motivation for the form of the divergence (2.1) can be obtained by looking at
the location model, where ∫ fϑ1+ an ( z )dz is independent of ϑ . In this case, the
fϑa ( X i ) , with the corresponding estimating
proposed estimators maximise
i =1
equations being of the form



∑ uϑ ( X ) fϑ ( X ) = 0 .
n

(2.4)

a

i =1

i

i

This can be viewed as a weighted version of the efficient maximum likelihood score
equation. When a > 0 , (2.4) provides a relative-to-the-model downweighting for
outlying observations; observations that are wildly discrepant with respect to the
model will get nearly zero weights. In the fully efficient case a = 0 , all
observations, including very severe outliers, get weights equal to one.
To construct the new criterion for goodness of fit we shall consider the quantity:
(2.5)



⎛ 1⎞
Wϑ = ∫ ⎨ fϑ1+ a ( z ) − ⎜ 1 + ⎟ g ( z ) fϑa ( z ) ⎬dz , a > 0
⎝ a⎠



which is the same as (2.1) without the last term that remains constant irrespectively of
the model fϑ used. Observe that the quantity (2.5) can also be written as:

- 488 -

⎛ 1⎞
Wϑ = ∫ fϑ1+ a ( z )dz − ⎜ 1 + ⎟ Eg ( fϑa ( Z ) ) , a > 0 .
⎝ a⎠

(2.6)

Our target theoretical quantity is
(2.7)

E (Wθˆ )

( )

where θˆ is the estimator of the parameter that minimizes (2.2). Observe that E Wθˆ

can be viewed as the average distance between g and fϑ and is known as the

expected overall discrepancy between g and fϑ . Note that the target quantity gets a
different value for each candidate model fϑ used. Our purpose is to obtain unbiased

estimates of the theoretical quantities (2.7), for each fϑ , which will then be used as a
new criterion for model selection denoted by DIC (Divergence Information
Criterion). The model fϑ selected will be the one for which DIC will be minimized.
This is discussed in Section 3.
The following Lemma provides the second derivative of (2.6). Observe that the first
derivative of (2.6) is given in Lemma 2.2.

{

1+ a
2
∂ 2Wϑ
= ( a + 1) ( a + 1) ∫ ⎡⎣uϑ ( z ) ⎤⎦ fϑ ( z ) dz − ∫ iϑ fϑ1+ a dz
2
∂ϑ

Lemma 2.3. The second derivative of (2.6) is:

(

)

(

+ Eg iϑ ( Z ) fϑ ( Z ) − E g a ⎡⎣uϑ ( Z ) ⎤⎦ fϑ ( Z )

where uϑ =

a

2


∂2
log ( fϑ ) ) and iϑ = − ( uϑ )′ = − 2 ( log ( fϑ ) ) .
(
∂ϑ
∂ϑ

- 489 -

a

)}

Proof.

∂ 2Wϑ
= ( a + 1)
∂ϑ 2

(

{∫ ⎡⎣−i ( z ) f
ϑ

1+ a

ϑ

( z ) + uϑ ( z )( a + 1) fϑa ( z ) fϑ′ ( z ) ⎤⎦ dz

− Eg −iϑ ( Z ) fϑ ( Z ) + auϑ ( Z ) fϑ

{∫ ⎡⎣( a + 1) ⎡⎣u ( z )⎤⎦

a −1

a

= ( a + 1)

(

ϑ

2

( Z ) fϑ′ ( Z ) )

fϑ1+ a ( z ) − iϑ ( z ) fϑ

− Eg −iϑ ( Z ) fϑ ( Z ) + a ⎡⎣uϑ ( Z ) ⎤⎦ fϑ ( Z )

{

2

a

= ( a + 1) ( a + 1) ∫ ⎡⎣uϑ ( z ) ⎤⎦ fϑ

(

)

2

(

1+ a

a

)

1+ a

( z )⎤⎦ dz

( z ) dz − ∫ iϑ fϑ1+ a ( z )dz

)}

+ Eg iϑ ( Z ) fϑ ( Z ) − aE g ⎡⎣uϑ ( Z ) ⎤⎦ fϑ ( Z ) .
a

2

a

Lemma 2.4. If the true distribution g belongs to the parametric family { fϑ }, then the
second derivative of (2.6) simplifies to:

∂ 2Wϑ
= ( a + 1) J ,
∂ϑ 2

(2.8)



where J = ⎡⎣uϑ ( z ) ⎤⎦ fϑ1+ a ( z ) dz . Also the first derivative of (2.6), under the same
assumption, is equal to 0.
2

(

)

Proof. If the true distribution g belongs to the parametric family { fϑ }, then:

Eg ⎡⎣uϑ ( Z ) ⎤⎦ fϑ ( Z ) = ∫ ⎡⎣uϑ ( z ) ⎤⎦ fϑ

and

2

(

a

2

1+ a

)

Eg iϑ ( Z ) fϑ ( Z ) = ∫ iϑ fϑ1+ a ( z )dz
a

( z ) dz

∂ Wϑ
= ( a + 1) J . It is obvious that the first derivative is 0.
∂ϑ 2
2

so that

Theorem 2.1. Under the assumptions of Lemma (2.4) the expected overall
discrepancy at ϑ = θˆ is given by

(2.9)

E (Wθˆ ) = Wθ +

⎢⎣(

( a + 1) E ⎡ θˆ − θ
2

)

2

J (θ ) ⎤ .
⎥⎦

Proof. Using a Taylor expansion of the quantity Wϑ around the true parameter θ and
under the assumptions of Lemma (2.4), Wϑ simplifies to:

- 490 -

Wϑ = Wθ +

(2.10)



where J (θ ) = ⎡⎣uθ ( z ) ⎤⎦ fθ1+ a ( z ) dz .

( a + 1) ϑ − θ 2 J θ ,
(
) ( )
2

2

It is easily seen that the expectation of Wϑ at ϑ = θˆ is given by (2.9).

3. THE NEW CRITERION DIC
In this section we introduce the new criterion which we show than it is an
approximately unbiased estimator of (2.7). First we have to estimate (2.6) because the
true distribution g is unknown. So we use a quantity like the empirical distribution
function and define Qϑ to be:

⎛ 1⎞1 n
Qϑ = ∫ fϑ1+ a ( z ) dz − ⎜ 1 + ⎟ ∑ fϑa ( X i ), a > 0 .
⎝ a ⎠ n i =1

(3.1)

The following Lemma provides the derivatives of Qϑ .
Lemma 3.1. The first derivative of (3.1) is:

1+ a
a
∂Qϑ
1 n


= ( a + 1) ⎢ ∫ uϑ ( z ) fϑ ( z ) dz − ∑ uϑ ( X i ) fϑ ( X i ) ⎥ , a > 0 .
n i =1
∂ϑ



{

1+ a
2
∂ 2Qϑ
= ( a + 1) ( a + 1) ∫ ⎡⎣uϑ ( z ) ⎤⎦ fϑ ( z ) dz − ∫ iϑ fϑ1+ a dz
2
∂ϑ
a
a
2
1 n
1 n

+ ∑ iϑ ( X i ) fϑ ( X i ) − ∑ a ⎡⎣uϑ ( X i ) ⎤⎦ fϑ ( X i ) ⎬
n i =1
n i =1


The second derivative of (3.1) is:

where uϑ =


∂2

and
log
f
=

=

i
u
(
)
(
)
(
( log ( fϑ ) ) .
ϑ )
ϑ
ϑ
∂ϑ 2
∂ϑ

Proof. The proof is similar to the proofs of Lemma (2.2) and Lemma (2.3).
The following theorem has been proved by Basu et. al (1998).
Theorem 3.1 [Basu et. al (1998)]. Under certain regularity conditions, for θˆ which
minimizes (2.2), we have, as n → ∞ ,
(i) θˆ is consistent for θ , and
(ii)

(

)

n θˆ − θ is asymptotically normal with mean zero and variance J −2 K , where

J=J( θ ) and K=K( θ ), under the assumption that the true distribution g belongs to the
parametric family { fϑ } and θ being the true value of the parameter, are given by:
(3.2)

J = ∫ ⎡⎣uθ ( z ) ⎤⎦ fθ
2

- 491 -

1+ a

( z ) dz

(3.3)

Κ = ∫ ⎡⎣uθ ( z ) ⎤⎦ fθ
2



where ξ = uθ ( z ) fθ

1+ a

1+ 2 a

( z ) dz .

( z ) dz − ξ 2

P ⎡ ∂Wϑ ⎤
⎡ ∂Qϑ ⎤
⎢ ∂ϑ ⎥ ⎯⎯→ ⎢ ∂ϑ ⎥

⎦θ

⎦θ

By the weak law of large numbers
(3.4)
and
(3.5)

⎡ ∂ 2Qϑ ⎤
P ⎡ ∂ 2Wϑ ⎤
⎯⎯
→⎢

2 ⎥
2 ⎥
⎣ ∂ϑ ⎦θ
⎣ ∂ϑ ⎦θ

as n → ∞ .
Theorem 3.2. The expectation of Qϑ evaluated at θ is given by

E ( Qθ ) ≡ E ( Qθˆ ) +

(

)

2
a +1 ⎡
E θ − θˆ J ⎤ .
⎢⎣
⎥⎦
2

P
→ θ as n → ∞ , from Theorem (3.1), equations (3.4) and (3.5),
Proof. Since θˆ ⎯⎯
and under the assumption that the true distribution g belongs to the parametric family
{ fϑ } and from Lemma (2.4) we have:

⎡ ∂Q ⎤
P
⎢⎣ ∂ϑ ⎥⎦ ˆ ⎯⎯→ 0
θ

and

⎡ ∂ 2Q ⎤
P
⎢ 2 ⎥ ⎯⎯→ ( a + 1) J
ϑ


⎦θˆ

so that for large n we have for a Taylor expansion of the quantity Qϑ around the
estimator θˆ which minimizes Basu’s discrepancy, the following approximation:

Qϑ = Qθˆ +

(

)

2
a +1
ϑ − θˆ J .
2

Substituting the true value θ for ϑ and taking expectations on both sides we have
the desired result.
Theorem 3.3. The expected overall discrepancy evaluated at θˆ is given by:

(

E ⎡⎣Wθˆ ⎤⎦ = E ( Qθˆ ) + ( a + 1) E ⎡ θˆ − θ
⎢⎣

- 492 -

)

2

J ⎤.
⎥⎦

yields E ( Qθ ) = Wθ .

Proof. Observe that the expectation of (3.1) for ϑ = θ

Combining the above relation and the result of Theorem (3.2) and using equation
(2.9) we obtain the desired result for the expected overall discrepancy.
The above result for p-dimensional θˆ can be expressed as:

(

) (

)




E ⎡⎣Wθˆ ⎤⎦ = E ( Qθˆ ) + ( a + 1) E ⎢ θˆ − θ J θˆ − θ ⎥ .



()

Taking into consideration [see Basu et. al (1998)] that

J = ( 2π )



a
2

(1 + a )

⎛ p⎞
−⎜ 1+ ⎟
⎝ 2⎠

Σ

⎛ α⎞
−⎜ 1+ ⎟
⎝ 2⎠


a2 ⎞
and Var θˆ = ⎜1 +

⎝ 1 + 2a ⎠

it can be easily seen that

(

)

J = ( 2π )





α

2

a
2

( ) (θˆ − θ )

and that for small a

θˆ − θ Σ



⎡Var θˆ ⎤



−1

⎛ 1+ a ⎞


⎝ 1 + 2a ⎠

1+

p
2

Σ



α
2

1+

⎡⎣Var (θ ) ⎤⎦

p
2

Σ

−1

is approximately Χ 2p , where

is the pxp

asymptotic covariance matrix of the maximum likelihood estimator of the pdimensional parameter θ . The new criterion DIC is defined by
(3.6)
such that

DIC = Qθˆ + ( a + 1)( 2π )



a
2

⎛ 1+ a ⎞


⎝ 1 + 2a ⎠

E ( DIC ) ≈ E (Wθˆ )

1+

p
2

p

( )

which implies that DIC is an approximately unbiased estimator of E Wθˆ .

4. DISCUSSION
Note that the family of candidate models is indexed by a single parameter a which
controls the trade-off between robustness and asymptotic efficiency of the parameter
estimators which are the minimizers of this family of divergences. When a → 0 ,
Basu’s density power divergence is the Kullback-Leibler divergence and the method
is maximum likelihood estimation; when a = 1 , the divergence is the L2 -distance,
and a robust but inefficient minimum mean squared error estimator ensues.
We are most interested in small values of a > 0 , near zero. There can be no universal
way of selecting an appropriate a parameter when applying our estimation methods.
The value of a specifies the underlying distance measure and typically dictates to

- 493 -

what extent the resulting methods become statistically more robust than the maximum
likelihood methods, and should be thought of as an algorithmic parameter. The
robustness of the proposed method can be easily understood in the case of the
location model where for a >0, the estimating equations given in (2.4) provide a
dowweighting for observations wildly discrepant with respect to the underlying
model. One way of selecting the parameter a is to fix the efficiency loss, at the ideal
parametric model employed, at some low level, like five or ten percent. Other ways
could in some practical applications involve prior motions of the extent of
contamination of the model.
This criterion could be used in applications where outliers or contaminated
observations are involved. Preliminary simulations with a contamination proportion
of approximately 10% show that DIC has a tendency of underestimation by selecting
the true model as well as smaller models in contrast with AIC which overestimates
the true model.

Acknowledgments
The authors wish to thank Tasos Christofides for fruitful conversations and an
anonymous referee for insightful comments and suggestions that greatly improve the
quality of the paper.
ΠΕΡΙΛΗΨΗ
Αυ
α α
α
α
υ
π υ
divergence)
α
υ α
π
α α α «απ α »,
υ (1998) α
α α
«απ α » α

υ

υ π
α » (discrepancy or
/
.
αα υ
α
Power Divergence π υ
απ
Basu α
α π
α
π
α
α
π α α
α
υπ
φ
.
α

α

α

υ

«απ

REFERENCES
Akaike, H. (1973). Information theory and an extension of the maximum likelihood
principle. Second International Symposium on Information Theory. (B. N. Petrov
and F. Csaki, eds.), 267-281, Akademiai Kaido, Budapest.
Basu, A., Harris, I. R., Hjort, N. L. and Jones, M. C. (1998). Robust and efficient
estimation by minimising a density power divergence. Biometrika, 85, no. 3, 549–
559.
Beran, R. (1977). Minimum Hellinger distance estimates for parametric models, Ann.
Statist., 5, 445–463.
Cressie, N. and Read, T. R. C. (1984). Multinomial goodness-of-fit tests. J. R. Statist.
Soc., B 46, 440–454.
Cressie, N. and Read, T. R. C. (1988). Goodness-of-Fit Statistics for Discrete
Multivariate Data, Springer Verlag, New York.

Lindsay, B. G. (1994). Efficiency versus robustness: the case for minimum
Hellinger distance and related methods. Ann. Statist., 22, 1081-1114.

- 494 -