07350015%2E2014%2E903086

Journal of Business & Economic Statistics

ISSN: 0735-0015 (Print) 1537-2707 (Online) Journal homepage: http://www.tandfonline.com/loi/ubes20

Varying Naïve Bayes Models With Applications to
Classification of Chinese Text Documents
Guoyu Guan, Jianhua Guo & Hansheng Wang
To cite this article: Guoyu Guan, Jianhua Guo & Hansheng Wang (2014) Varying Naïve Bayes
Models With Applications to Classification of Chinese Text Documents, Journal of Business &
Economic Statistics, 32:3, 445-456, DOI: 10.1080/07350015.2014.903086
To link to this article: http://dx.doi.org/10.1080/07350015.2014.903086

Accepted author version posted online: 17
Mar 2014.

Submit your article to this journal

Article views: 172

View related articles


View Crossmark data

Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=ubes20
Download by: [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RA JA ALI HA JI
TANJUNGPINANG, KEPULAUAN RIAU]

Date: 11 January 2016, At: 21:01

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

Varying Naı̈ve Bayes Models With Applications
to Classification of Chinese Text Documents
Guoyu GUAN and Jianhua GUO
Key Laboratory for Applied Statistics of the Ministry of Education, and School of Mathematics and Statistics,
Northeast Normal University, Changchun 130024, P. R. China (guangy599@nenu.edu.cn; jhguo@nenu.edu.cn)

Hansheng WANG
Department of Business Statistics and Econometrics, Guanghua School of Management, Peking University,
Beijing 100871, P. R. China (hansheng@gsm.pku.edu.cn)

Document classification is an area of great importance for which many classification methods have
been developed. However, most of these methods cannot generate time-dependent classification rules.
Thus, they are not the best choices for problems with time-varying structures. To address this problem,
we propose a varying naı̈ve Bayes model, which is a natural extension of the naı̈ve Bayes model that
allows for time-dependent classification rule. The method of kernel smoothing is developed for parameter
estimation and a BIC-type criterion is invented for feature selection. Asymptotic theory is developed and
numerical studies are conducted. Finally, the proposed method is demonstrated on a real dataset, which
was generated by the Mayor Public Hotline of Changchun, the capital city of Jilin Province in Northeast
China.
KEY WORDS: BIC; Chinese document classification; Screening consistency; Time-dependent classification rule.

1.

INTRODUCTION

Document classification (Manevitz and Yousef 2001) is an
important area of statistical applications, for which many methods have been developed. These methods include linear discriminant analysis (Johnson and Wichern 2003; Fan and Fan 2008;
Leng 2008; Qiao et al. 2010; Shao et al. 2011; Mai, Zou, and
Yuan 2012, LDA), support vector machines (Wang, Zhu, and
Zou 2006; Zhang 2006; Wu and Liu 2007; Liu, Zhang, and Wu

2011, SVM), classification and regression trees (Breiman et al.
1998; Hastie, Tibshirani, and Friedman 2001, CART), k-nearest
neighbor (Ripley 1996; Hastie, Tibshirani, and Friedman 2001,
KNN), boosting (Buhlmann and Yu 2003; Zou, Zhu, and Hastie
2008; Zhu et al. 2009), random forests (Breiman 2001; Biau,
Devroye, and Lugosi 2008, RF), and many others. Recently,
Wu and Liu (2013) proposed a novel SVM method for functional data; see also Ramsay and Silverman (2005) for more
relevant discussions. However, how to construct a classification
rule for usual (nonfunctional) data with a time-varying structure
remains unclear. Thus, practical applications with time-varying
data structures demand novel classification methods with timedependent structures.
We propose a novel solution to address this problem called
the varying naı̈ve Bayes (VNB) model, which is a natural extension of the original naı̈ve Bayes model (Lewis 1998; Murphy
2012, NB). The key difference is that the parameters in VNB are
assumed to vary smoothly according to a continuous index variable (e.g., time). This generates a time-dependent classification
rule. The proposed method is motivated by real applications.
For example, the Mayor Public Hotline (MPH) is an important
project led by the local government of Changchun city, the capital city of Jilin Province in Northeast China, which has an area
of 20,604 km2 and a population of 7.9 million. The MPH project


gives local residents the opportunity to call the Mayor’s office
and report various public issues via the appeal hotline “12345.”
The typical issues reported include: local crime, education, public utility, transportation, and many others. Each phone call is
recorded and converted into a text message in Chinese by an operator. Next, experienced staff in the Mayor’s office are required
to manually classify these documents into different classes according to their corresponding functional departments in the
local government (e.g., transportation department and police
department).
Clearly, manually classifying these documents accurately to
the correct functional departments is a challenging task, both
technically and physically. Technically, there are many departments in the local government. Thus, the staff responsible for
this task need to be highly familiar with the structure of the
entire government and the functionality of each department.
Physically, the total amount of text documents that need to be
processed every day is tremendous. Because the MPH project
was operated successfully in previous years, the daily amount
of MPH calls has increased from about 10 calls in 2000 to several thousand at present. Classifying this huge amount of documents by human labor is a virtually impossible task. However,
addressing this challenging task by statistical learning is a major
problem.
To address this problem, we employ the concept of vector
space modeling (Lee, Chuang, and Seamons 1997) and collect

a bag of the most frequently used Chinese keywords. These

445

© 2014 American Statistical Association
Journal of Business & Economic Statistics
July 2014, Vol. 32, No. 3
DOI: 10.1080/07350015.2014.903086

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

446

keywords are indexed by j with 1 ≤ j ≤ p. Subsequently, for
a given document i (1 ≤ i ≤ n), we define a binary feature
Xij = 1 if the jth keyword appears in the ith document, and we
define Xij = 0 otherwise. In this manner, the original text document i is converted into a binary vector Xi = (Xi1 , . . . , Xip )⊤ ∈
{0, 1}p . Because the total number of keywords (i.e., p) involved
in the MPH project is huge, the feature dimension p is ultrahigh. Let Yi be the corresponding class label (i.e., the associated
functional department). To solve the problem, a classification

method is required so that a regression relationship can be constructed from Xi to Yi . Among the possible solutions, we find
that naı̈ve Bayes (NB) is particularly attractive for the following
reasons.
First, NB is theoretically elegant. More specifically, NB assumes that different binary features are mutually independent
after conditioning on the class label. Therefore, the likelihood
function can be derived analytically and the maximum likelihood estimator can be obtained easily. As a result, NB is computationally easier than more sophisticated methods, such as
latent Dirichlet allocation method (Blei, Ng, and Jordan 2003).
In addition, the actual performance of NB on real datasets is
competitive. Depending on the datasets, NB might not be the
single best classification method. However, its performance is
competitive across many different datasets. Thus, it has been
ranked as one of the top ten algorithms in data mining; see, for
example, Lewis (1998) and Wu and Kumar (2008). It is remarkable that some other popular document representation schemes,
such as term (keyword) frequency-inverse document frequency
(Salton and McGill 1983), do not perform well on the MPH
dataset. This is because the frequencies of most keywords are
extremely low in a single document. They are either one or zero
in most cases, because the majority of keywords appear no more
than once in each document.
Despite its popularity in practice, the performance of NB

can be improved further. This is particularly true for the MPH
dataset. In particular, we find that MPH documents recorded
at different times of day (i.e., the recording time) might follow different classification patterns. For example, traffic issues are more likely to be reported during rush hour and less
likely at midnight. Unfortunately, the classical NB method
cannot use this valuable information. As a result, its prediction accuracy is suboptimal. This motivated us to develop
the VNB method. This new method adopts a standard naı̈ve
Bayes formulation for documents recorded at the same time
of day. However, the documents recorded at different recording times are allowed to have different classification patterns.
This is achieved by allowing the model parameters to vary
smoothly and nonparametrically according to the recording
time. Kernel smoothing technique is used to estimate the unknown parameters, and a BIC-type criterion is proposed to identify important features. The new method’s outstanding performance is confirmed numerically on both simulated data and the
MPH datasets. Our research is motivated by the MPH project,
but the methodology we have developed is applicable to any
classification problems with binary features and time-varying
structures.
The remainder of this article is organized as follows. The
next section describes the VNB method. The methods used for

Journal of Business & Economic Statistics, July 2014


parameter estimation and feature screening (Fan and Lv 2008;
Wang 2009) are also discussed. Numerical studies based on
both simulated data and the MPH datasets are presented in
Section 3. Finally, the article concludes with a brief discussion
in Section 4. All of the technical details are provided in the
Appendix.

2.

THE METHODOLOGY DEVELOPMENT

2.1 Varying Naı̈ve Bayes Model
Recall that each MPH record is indexed by i with 1 ≤ i ≤ n.
The associated high-dimensional binary feature is given by
Xi = (Xi1 , . . . , Xip )⊤ ∈ {0, 1}p , and the corresponding class label is recorded by Yi ∈ {1, . . . , m}. Define P (Yi = k) = πk and
P (Xij = 1|Yi = k) = θkj . Then, a standard naı̈ve Bayes (NB)
model assumes that

P (Xi = x|Yi = k) =


p


j =1

x

θkjj (1 − θkj )1−xj ,

where x = (x1 , . . . , xp )⊤ ∈ {0, 1}p represents a particular realization of Xi . Next, we use Ui to denote the index variable (i.e.,
the recording time) of the ith document. Without loss of generality, we assume that Ui has been standardized appropriately
such that Ui ∈ [0, 1]. To take Ui into consideration, we propose
the following varying naı̈ve Bayes (VNB) model,

P (Xi = x|Yi = k, Ui = u) =

p


j =1


{θkj (u)}xj {1 − θkj (u)}1−xj ,
(2.1)

where u is a particular realization of Ui , θkj (u) = P (Xij =
1|Yi = k, Ui = u), and P (Yi = k|Ui = u) = πk (u). Both θkj (u)
and πk (u) are assumed to be unknown but smooth functions in u.
In other words, we assume that different features (i.e., Xij ) are
independent, conditional on the class label (i.e., Yi ), and index
variable (i.e., Ui ).
Clearly, the key difference between NB and VNB is that VNB
allows the model parameters (i.e., θkj and πk ) to be related nonparametrically to the recording time u. As a result, VNB adopts
a standard NB formulation for documents recorded at the same
time of day. However, the classification patterns of documents
with different recording times are allowed to be (but not necessarily) different. In an extreme situation, if the classification
pattern does not change according to the recording time, the
parameters involved in (2.1) become constants. Thus, the VNB
model reduces back to a standard NB model. Therefore, VNB
contains NB as a special case. However, if the classification
pattern does change according to the recording time, VNB has

a much greater modeling capability and classification accuracy.
To estimate the unknown parameters, we employ the concept
of local kernel smoothing (Fan and Gijbels 1996), which yields

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

Guan, Guo, and Wang: Varying Naive Bayes Models With Applications to Classification of Chinese Text Documents

the following simple estimators,
−1  n
 n



Kh (Ui − u)Zik ,
Kh (Ui − u)
π̂k (u) =
θ̂kj (u) =



i=1

i=1

n


−1  n


i=1

Kh (Ui − u)Zik

i=1

parameter φj can be estimated as



(2.3)
where Zik = I (Yi = k), Kh (t) = h−1 K(t/ h), and the kernel
function K(t) is a probability density function symmetric about
0. Then, based on (for example) the results of Li and Racine
(2006), we know that these two estimators are consistent, as
long as h → 0 and nh → ∞. Following Li and Racine (2006),
we set h = αn−1/5 , where α is a tuning constant that needs to
be selected based on the data. In practice, different α values
should be used in the numerators and denominators of (2.2) and
(2.3), for different k and j. These are selected by minimizing the
cross-validated misclassification error. To simplify the notation,
we use a common α throughout the rest of this study. Thus, the
posterior probability can be estimated as

(2.4)

Subsequently, the unknown functional department of a new
observation with X0 = x and U0 = u can be predicted as
(Y0 = k|X0 = x, U0 = u).
Ŷ0 = argmaxk P
2.2 Feature Selection

Because the total number of frequently used Chinese keywords is huge in the MPH project, the feature dimension p is
ultrahigh. However, most of the keywords (or features) are irrelevant for classification. As a result, the task of feature screening
(Fan and Lv 2008; Wang 2009) becomes important. Intuitively,
if a feature j is irrelevant for classification, its response probability, that is, θkj (u), should be the same across all the functional
departments (i.e., k) and for almost all of the recording time (i.e.,
2
u). Thus, we have E{θkj (Ui ) − θ
j (Ui )} = 0 for every k, where
θj (u) = P (Xij = 1|Ui = u) = k πk (u)θkj (u). By contrast, if
a feature j is relevant for classification, the response probability
θkj (u) will be different for at least two functional departments
and for a nontrivial amount of the recording time. As a result, we
have E{θkj (Ui ) − θj (Ui )}2 = 0 for some k. This suggests that
whether a feature j is relevant for classification is determined
fully by
φj =

m

k=1

φ̂j = n−1

(2.2)

Kh (Ui − u)Xij Zik ,

(Yi = k|Xi = x, Ui = u)
P
p
π̂k (u) j =1 {θ̂kj (u)}xj {1 − θ̂kj (u)}1−xj
= 
.
p
xj
1−xj
k ′ π̂k ′ (u)
j =1 {θ̂k ′ j (u)} {1 − θ̂k ′ j (u)}

447

πk E{θkj (Ui ) − θj (Ui )}2 ,

where πk = P (Yi = k) = E{πk (Ui )}. Thus, a feature j is relevant for classification if and only if φj = 0. Then, the true
model can be defined as MT = {1 ≤ j ≤ p : φj > 0}. We define MF = {1, . . . , p} as the full model. We also define a
generic notation M = {j1 , . . . , jd } as an arbitrary model with
Xij1 , . . . , Xijd as relevant features. Next, for a finite dataset, the

m 
n

k=1 i=1

π̂k {θ̂kj (Ui ) − θ̂j (Ui )}2 ,



where π̂k = n−1 i Zik and θ̂j (Ui ) = k π̂k (Ui )θ̂kj (Ui ). Intuitively, features with larger φ̂j values are more likely to be
relevant for classification. By contrast, those with smaller φ̂j
values are less likely to be relevant. This suggests that the true
c = {1 ≤ j ≤ p : φ̂j > c},
model MT can be estimated by M
for some appropriately determined critical value c ≥ 0. Indeed,
c is consistent for
by the following theorem, we know that M
MT , provided the critical value c is selected appropriately.
Theorem 1. Assuming the conditions (C1)–(C4) in Appendix
A, for any constant α > 0 and c = 2−1 minj ∈MT φj , we have
c = MT ) → 1 as n → ∞.
P (M
2.3.

Critical Value Selection

c is a consistent
According to Theorem 1, we know that M
estimator of MT , provided the critical value c is selected appropriately. However, the selection of c is not clear in practice. By
considering all the possible c values on [0, ∞), a sequence of
candidate models can be generated, which constitute a solution
c : c ≥ 0}. Although there are infinitely many difpath M = {M
c are finite. More
ferent choices for c, the resulting choices for M
specifically, let {(1), (2), . . . , (p)} be a particular permutation of
{1, 2, . . . , p} such that φ̂(1) > φ̂(2) > · · · > φ̂(p) . Thus, the solu (d) : 0 ≤ d ≤ p} with
tion path can also be given by M = {M


M(0) = ∅ and M(d) = {(1), (2), . . . , (d)} for 1 ≤ d ≤ p, which
is a finite set with a total of p + 1 nested candidate models. According to Theorem 1, we must have MT ∈ M with probability
tending to one. Thus, the original problem of critical value determination for c is converted into a problem of model selection
with respect to M. Then, we follow Wang and Xia (2009) and
consider the following BIC-type criterion,
−2

BIC(M) = −2n

n 
n

i=1 l=1

+ df (M) ×

(Xi , Yi |Ul , M)
Kh (Ui − Ul ) log P

log(nh)
,
nh

where the estimated conditional
(Xi , Yi |Ul , M) is defined as
P

(2.5)

probability

function

m
m


[{θ̂kj (Ul )}Xij {1 − θ̂kj (Ul )}1−Xij ]Zik
{π̂k (Ul )}Zik ·
j ∈M k=1

k=1

·



{θ̂j ′ (Ul )}Xij ′ {1 − θ̂j ′ (Ul )}1−Xij ′ ,

j ′ ∈M
/

and df (M) = (m − 1) + m|M| + (p − |M|) is the number
(Xi , Yi |Ul , M).
of parameters that need to be estimated in P
Under the assumption 1 ≤ |MT | ≤ n given in Appendix
= M
(d̂) , where d̂ =
A, the best model is selected as M
(d) ). Then, the screening consistency
argmin1≤d≤min{n,p} BIC(M
of this method can be established by the following theorem in
the sense of Fan and Lv (2008) and Wang (2009). Our extensive

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

448

Journal of Business & Economic Statistics, July 2014

simulation experience also suggests that this BIC-type method
works quite well.
Theorem 2. Assuming the conditions (C1)–(C4)
in Appendix



→ 1 as
A, for any constant α > 0, we have P MT ⊂ M
n → ∞.
3.

NUMERICAL STUDIES

To demonstrate the finite sample performance of the proposed
VNB method, a number of simulation experiments are conducted and the MPH dataset is analyzed. As mentioned in Section 2.1, different α values should be used in the numerators and
denominators of (2.2) and (2.3), for different k and j. However,
this is not feasible for ultrahigh dimensional feature because
−1/5
in
there are so many parameters. In practice, we set h = αnk


−1/5
K
(U

u)Z
,
and
h
=
αn
in
K
(U

u)X
Z
i  ik
i
ij ik ,
i h
i h
kj
where nk = i Zik and nkj = i Zik Xij are the effective sample sizes of these estimators. Throughout
√ the rest of this study,
we use the Gaussian kernel K(t) = ( 2π )−1 exp{−t 2 /2} and
the optimal proportionality constant α of the bandwidths is selected using a five-fold cross-validation by minimizing the misclassification error. Various sample sizes, feature dimensions,
and true model sizes are considered in the simulation studies.
For each fixed parameter setting, 100 simulation replications
are conducted. For each simulated dataset, the BIC estimator
is obtained. The percentage of incorrect zeros (Fan and
M
∩ MT |}|MT |−1 ,
Li 2001), that is, PIZ = 100% × {|(MF \M)
is computed and averaged. Similarly, the percentage of cor ∩
rect zeros is calculated, that is, PCZ = 100% × {|(MF \M)
(MF \MT )|}{|(MF \MT )|}−1 . Finally, to evaluate the classification accuracy of the resulting model, another 1000 inde
pendent test samples are generated. The M-based
VNB classification accuracy is then evaluated on the test samples. The

average misclassification error (AME) is computed. For comparison’s sake, the AME values of KNN, AdaB (Adaptive Boosting), RF, SVM, NB, VNBπ , and VNB but based on the true
model MT are also included. In this case, the VNBπ method
refers to the
classificationrule (2.4) where θ̂kj (u) is replaced
with θ̂kj = ( i Zik Xij )/( i Zik ). In other words, the parameter πk (u) is still allowed to vary according to u, but θkj is fixed
as a constant.

3.1 Example 1: A Naı̈ve Bayes Model
First, we consider a standard NB model, where the parameters
are fixed constants that do not change according to u. Thus, the
standard NB method is expected to have the best performance.
By contrast, because of the unnecessary noise introduced by
the recording time, the VNB method should perform worse. We
then use this example to numerically investigate the efficiency
loss experienced by the VNB method when the true model is
actually NB.
More specifically, the recording time Ui is generated from a
uniform distribution on [0, 1]. Next, we generate the associated
functional department Yi ∈ {1, . . . , m} with probability P (Yi =
k|Ui = u) = P (Yi = k) = 1/m and m = 3. Given Yi and Ui ,
the jth keyword indicator Xij is generated from a binary distribution with probability P (Xij = 1|Yi = k, Ui = u) = θkj for
j ∈ MT , and P (Xij = 1|Yi = k, Ui = u) = θj for j ∈
/ MT ,
where MT = {1, . . . , d0 } is the true model with size d0 . In
addition, {θkj }1≤k≤m,j ∈MT and {θj }j ∈M
/ T are simulated from a
uniform distribution on [0.1, 0.9]. Thus, both the parameters
πk (u) and θkj (u) are actually constants that do not change according to the recording time u.
The detailed simulation results are provided in Table 1, which
shows that the MT -based NB method performs best in terms
of AME. This is expected because the true model is indeed NB

Table 1. Simulation results for Example 1
MT -based AME(%)


M-based
results(%)

p

d0

n

KNN

AdaB

RF

SVM

NB

VNBπ

VNB

AME

PCZ

PIZ

500

20

200
500
1000

16.3
14.9
14.3

15.8
13.5
13.0

13.4
12.2
11.6

14.2
12.9
12.4

9.8
9.3
9.1

9.8
9.4
9.1

10.1
9.5
9.4

11.0
9.6
9.5

99.7
99.9
99.9

22.3
11.7
8.9

30

200
500
1000

12.4
11.1
10.7

13.3
11.0
10.4

10.2
8.7
8.0

11.2
9.2
8.7

6.6
5.9
5.6

6.6
5.9
5.7

6.8
6.1
5.9

7.6
6.2
6.0

99.7
99.9
100.0

28.1
16.6
12.2

50

200
500
1000

4.8
4.4
3.6

7.1
5.7
5.3

4.6
3.7
3.4

5.6
3.5
3.1

1.6
1.4
1.4

1.6
1.4
1.4

1.7
1.4
1.5

2.0
1.5
1.5

99.7
99.9
99.9

22.3
12.2
8.0

20

200
500
1000

15.8
14.4
14.3

15.7
13.6
12.8

13.8
12.1
11.3

14.4
12.6
12.3

9.9
9.2
8.9

9.9
9.2
8.9

10.2
9.4
9.1

11.5
9.5
9.2

99.7
99.9
99.9

21.8
11.3
9.1

30

200
500
1000

13.2
11.6
10.8

13.6
11.2
10.4

10.1
8.6
8.2

11.3
9.1
8.7

6.5
5.8
5.6

6.6
5.8
5.6

6.8
6.1
5.8

7.9
6.4
6.0

99.7
99.9
99.9

29.1
17.3
11.7

50

200
500
1000

5.3
4.5
4.1

7.4
5.7
5.1

4.5
3.6
3.3

5.5
3.5
3.1

1.7
1.4
1.4

1.7
1.4
1.4

1.8
1.5
1.4

2.1
1.6
1.4

99.7
99.9
99.9

23.4
12.3
8.0

1000

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

Guan, Guo, and Wang: Varying Naive Bayes Models With Applications to Classification of Chinese Text Documents

449

Table 2. Simulation results for Example 2
MT -based AME(%)


M-based
results(%)

p

d0

n

KNN

AdaB

RF

SVM

NB

VNBπ

VNB

AME

PCZ

PIZ

500

20

200
500
1000

26.4
24.1
23.2

30.4
28.1
27.1

22.9
20.3
19.5

26.3
24.1
23.4

24.2
22.9
22.6

24.6
23.0
22.7

17.7
15.6
14.8

24.6
16.4
15.2

91.9
98.9
100.0

11.3
9.7
9.0

30

200
500
1000

23.5
20.8
19.5

29.0
25.7
24.4

19.3
16.6
15.4

24.7
21.6
20.5

22.3
21.0
20.6

22.7
21.1
20.7

14.5
12.1
11.1

19.7
12.9
11.8

92.9
99.3
99.9

18.0
14.3
12.0

50

200
500
1000

15.6
13.2
12.2

24.2
21.4
20.5

13.9
11.3
10.0

21.3
16.9
15.6

17.0
16.1
15.3

17.2
16.2
15.4

8.8
6.6
5.7

11.8
7.1
6.1

91.7
98.4
99.9

12.8
9.3
7.2

20

200
500
1000

26.1
23.7
23.3

30.3
27.8
27.1

22.8
20.5
19.3

26.0
24.2
23.5

23.7
23.1
22.7

24.0
23.3
22.9

17.7
15.7
14.8

29.3
16.8
15.2

91.6
99.3
99.9

11.2
10.2
9.1

30

200
500
1000

23.2
20.7
19.8

28.6
25.9
24.8

19.8
16.7
15.8

24.4
21.7
20.8

22.5
21.6
20.9

22.9
21.8
21.0

14.7
12.1
11.3

25.6
13.3
11.8

90.2
99.1
99.9

14.8
13.7
12.0

50

200
500
1000

15.6
13.4
11.9

24.5
21.9
20.5

13.9
11.3
10.0

21.2
17.0
15.6

17.0
15.7
15.4

17.2
15.8
15.5

8.9
6.6
5.6

14.6
7.4
6.1

92.0
98.6
99.9

11.9
9.6
7.5

1000

(not VNB). However, the performance of MT -based VNB is
comparable. This suggests that even if the true model is NB, the
efficiency loss experienced by VNB is very limited. Next, we

interpret the results obtained with the M-based
VNB method.
We find that larger sample size n results in smaller AME values, if both p and d0 are fixed. This is expected because larger
sample size yields more accurate estimators. Furthermore, we
find that the PCZ value approaches 100% and the PIZ value
approaches 0% rapidly as n increases. This confirms that the
proposed BIC-type criterion is consistent for feature screening.
In the meanwhile, with a fixed p and n, we find that larger true
model size d0 produces smaller AME values, because the inclusion of more relevant features improves the prediction. Finally,
with a fixed d0 and n, we find that larger feature dimension p
results in worse performance in terms of AME. This is also reasonable because larger feature dimension is more challenging
for feature screening and it results in worse prediction.

3.2 Example 2: A Varying Naı̈ve Bayes Model
We then consider a situation where the data are generated
from a genuine VNB model. We aim to investigate the level of
improvement obtained using VNB compared with NB and other
possible competitors. More specifically, the functional department Yi and the recording time Ui are generated in the same
manner as Example 1 with m = 3. Conditional on Yi and Ui ,
the high-dimensional feature Xi is generated according to the
VNB model (2.1) with
P (Xij = 1|Yi = k, Ui = u) = θkj (u)

0.5 + 0.4 sin{π (u + Rkj )}, j ∈ MT ,
=
0.5 + 0.4 sin{π (u + R0j )}, j ∈
/ MT ,

where {Rkj }1≤k≤m,j ∈MT and {R0j }j ∈M
/ T are simulated from a
uniform distribution on [0, 1]. We then follow Example 1 and
replicate the experiment randomly 100 times. The detailed results are summarized in Table 2, which shows that NB and the
other competitors are not competitive with VNB in terms of the
AME. This is expected because VNB is the only classification
method with the capacity to produce time-dependent classification rule. The VNBπ method allows π̂k (u) to vary with u, but it
keeps θ̂kj (u) as a constant, which results in it performing much

worse than VNB. Finally, we find that the M-based
results are
qualitatively similar to those in Table 1. Thus, a more detailed
discussion is omitted.
3.3

MPH Data

To demonstrate the practical utility of the proposed VNB
method, we consider a MPH dataset. More specifically, the
dataset used consists 13,613 documents from m = 8 different
functional departments with p = 4,778 keywords. All of the
observations are collected between 8:00 AM and 24:00 PM.
Then, the recording time are standardized to [0, 1]. Half of
these documents (i.e., n = 6806) are selected randomly for
training whereas the remainder are used for testing. At each
testing recording time, all parameters of the VNB model are
estimated by kernel estimators (2.2) and (2.3) based on the
training data. To visualize the time-varying pattern, the estimators {π̂k (u) : 1 ≤ k ≤ 8} with their 90% confidence bands based
on 1000 bootstrap replications of the training set are plotted in
Figure 1. Similarly, the time-varying pattern of θ̂kj (u) can also
be obtained in the same manner (e.g., see Figure 2). These two
figures show that the confidence bands become larger as u tends
to 1. That’s because events are likely to happen in the daytime
and less likely to happen at midnight, which results in the effective sample sizes becoming smaller as u tends to 1. Thus, such a

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

450

Journal of Business & Economic Statistics, July 2014

k=1

k=2

k=3

k=4

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0
0

0.5

1

0
0

k=5

0.5

1

0
0

k=6

0.5

1

0

k=7

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0
0

0.5

1

0
0

0.5

1

1

k=8

0.4

0

0.5

0
0

0.5

1

0

0.5

1

Figure 1. {π̂k (u) : 1 ≤ k ≤ 8} (solid lines) together with their 90% confidence bands (dashed lines) based on 1000 bootstrap replications of
the training set are plotted. The time-varying pattern can be clearly obtained.

time-dependent classification method should be more practical
for this type of data. We then apply the proposed BIC method
to the training set and 489 keywords are selected as relevant
features. Based on these selected keywords, a VNB classifier
is estimated and its misclassification error (ME) is evaluated
based on the testing set. For comparison’s sake, ME values of
full model based KNN, AdaB, RF, SVM, NB, and VNBπ are also
included. Alternatively, a simple way to deal with the record-

k=1

ing time is to include it directly as the (p + 1)th feature
(i.e.,

Xi,p+1 = Ui , particularly set Xi,p+1 = I (Ui > n−1 i ′ Ui ′ ) for
NB). Its actual performance is evaluated on various methods
except VNBπ and VNB. The detailed results are summarized
in Table 3. As one can see, VNBπ performs better than NB,
which confirms that the class probabilities {π̂k (u) : 1 ≤ k ≤ 8}
are truly time-varying. In addition, VNB performs best and improves the prediction accuracy further.

k=2

k=3

k=4

1

1

1

1

0.5

0.5

0.5

0.5

0

0
0

0.5

1

0
0

k=5

0.5

1

0
0

k=6

0.5

1

0

k=7

1

1

1

0.5

0.5

0.5

0.5

0
0

0.5

1

0
0

0.5

1

1

k=8

1

0

0.5

0
0

0.5

1

0

0.5

1

Figure 2. {θ̂kj1 (u) : 1 ≤ k ≤ 8} (solid lines) together
with their 90% confidence bands (dashed lines) based on 1000 bootstrap replications of
 
2
the training set are plotted, where j1 = argmaxj ∈M
{
k
i π̂k [θ̂kj (Ui ) − θ̂kj ] }. Different time-varying patterns are clearly obtained.

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

Guan, Guo, and Wang: Varying Naive Bayes Models With Applications to Classification of Chinese Text Documents

451

Table 3. Results of MPH Example
Classification methods

KNN

AdaB

RF

SVM

NB

VNBπ

VNB

ME(%) based on p features
ME(%) based on p + 1 features

23.58
26.47

36.99
36.99

6.07
6.76

29.19
29.15

6.77
6.76

5.83


5.32


4.

CONCLUDING REMARKS

In this study, we developed a novel method, VNB, for Chinese
document classification. This new method can be viewed as a
natural extension of the classical NB method but with a much
improved modeling capability. Nonparametric kernel method
was used to estimate the unknown parameters and a BIC-type
screening criterion (Wang and Xia 2009) was used to select important features. The excellent performance of the new method
was confirmed by both simulated data and the MPH datasets.
To conclude the article, we identify two interesting topics for
future study.
First, the current VNB model only handles a univariate index
variable. In a more complex situation, it is likely to have a multivariate index variable. For example, in addition to the recording
time, the demographic information (e.g., age and gender) related
to the caller can also be collected. This generates a multivariate
index variable. In this case, modeling πk (u) and θkj (u) directly
as fully nonparametric functions in u is unlikely to be the optimal choice. Thus, various semiparametric models might be more
preferable, for example, varying coefficient models (Hastie and
Tibshirani 1993; Cai, Fan, and Li 2000), single index models
(Xia 2006; Kong and Xia 2007), and partially linear models
(Härdle, Liang, and Gao 2000; Fan and Huang 2005).
Second, the current VNB model assumes that every feature
has time-varying effect. In fact, it is likely that the effects of some
features do not change according to time. Thus, there might exist
some feature j, such that θkj (u) = θkj for every 1 ≤ k ≤ m and
0 ≤ u ≤ 1. We then define
μj =

m

k=1

πk E{θkj (Ui ) − θkj }2 .

Accordingly, {θkj (u) : 1 ≤ k ≤ m} are constant functions, if and
only if μj = 0. Next, we define a varying model as V = {1 ≤
j ≤ p : μj > 0}. In practice, μj can be estimated by
μ̂j = n−1

m 
n

k=1 i=1

π̂k {θ̂kj (Ui ) − θ̂kj }2 .

 = {1 ≤ j ≤
Thus, the varying model can be estimated as V
p : μ̂j > ξ } for some ξ > 0. Theoretically, we can prove that
 = V) → 1 as n → ∞, if ξ is selected in an appropriate
P (V
manner. However, the practical selection of ξ is still an ongoing
research project. Further research is needed in this direction.
APPENDIX A. TECHNICAL CONDITIONS
To investigate the asymptotic properties of the proposed VNB
method, the following technical conditions are needed.

(C1) (Kernel Assumption) K(t) is a symmetric and bounded
density function, with finite second-order moment and
bounded first-order derivative on R1 .
(C2) (Smoothness Assumption) πk (u), θkj (u) and f (u) have
bounded first-order and second-order derivatives on [0, 1],
for all k and j, where f (u) is the density function of Ui .
(C3) (Dimension Assumption) The true model size is assumed
to be 1 ≤ |MT | ≤ n, and log p ∝ nξ for some 0 < ξ <
3/5, as n → ∞.
(C4) (Boundedness Assumption) There exist some positive
constants fmax and ν < min{1/m, 1/3}, such that ν ≤
f (u) ≤ fmax , πk (u) ≥ ν and ν ≤ θkj (u) ≤ 1 − ν, for all
k, j and u ∈ [0, 1]. Moreover, for j ∈ MT , there exist some k ∈ {1, . . . , m} and some set Skj ⊂ [0, 1] with
measure no less than some positive constant τ , such that
inf u∈Skj |θkj (u) − θj (u)| ≥ ν.
Both conditions (C1) and (C2) are quite standard and have been
used extensively in existing literature (Fan and Gijbels 1996; Li
and 
Racine 2006). Under these assumptions, we have fˆ(u) =
n−1 i Kh (Ui − u) →p f (u), π̂k (u) →p πk (u) and θ̂kj (u) →p
θkj (u), as long as h → 0 and nh → ∞. Condition (C3) gives
us the exponential divergence speed of p, as n → ∞. Under
condition (C4), we can see clearly that minj ∈MT φj > 0 and
φj = 0 for j ∈
/ MT . Thus, all relevant features are apart from
the irrelevant ones, which can be identified consistently based
on the appropriate critical value c.
APPENDIX B. A TECHNICAL LEMMA
Lemma B.1. Let 0 < ν < 1/3, then there exists a positive
constant C (depending on ν), the following inequality holds,


A
1−A
inf
A log + (1−A) log
≥ C.
ν≤A≤1−ν,ν≤B≤1−ν,|A−B|≥ν
B
1−B

(B.1)

A
B

+ (1 −
≥ 0, and the equality holds if and only if A = B.
A) log
Now, we consider a binary function g(x, y) = x log xy + (1 −
1−x
, where x, y ∈ [ν, 1 − ν] and |x − y| ≥ ν. Then, we
x) log 1−y
can easily obtain the first-order and second-order partial derivatives of g with respect to y,
Proof. By Jensen’s inequality, it is clear that A log
1−A
1−B

x − 2xy + y 2
x 2 − 2xy + y 2
∂g
y − x ∂ 2g
>
> 0.
=
, 2 =
2
2
∂y
y(1 − y) ∂y
y (1 − y)
y 2 (1 − y)2
It follows that, given x, g(x, y) is a convex function of y and
reaches its minimum at y = x + ν or x − ν. Next, consider
two functions g(x, x + ν) with x ∈ [ν, 1 − 2ν] and g(x, x − ν)
with x ∈ [2ν, 1 − ν], which have the same minimum. Without loss of generality, we may only care about the continuous

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

452

Journal of Business & Economic Statistics, July 2014

function g(x, x + ν). Thus, there exists some x0 ∈ [ν, 1 −
2ν], such that g(x0 , x0 + ν) = minx∈[ν,1−2ν] {g(x, x + ν)}. Recall Jensen’s inequality, g(x0 , x0 + ν) > 0 is obvious. Finally,

taking C = g(x0 , x0 + ν), inequality (B.1) holds.
APPENDIX C. PROOF OF THEOREM 1
To prove the theorem, it suffices to show that maxj |φ̂j −
φj | →p 0. To this end, we write φ̂j − φj as follows:
φ̂j − φj = n−1

m 
n

(π̂k − πk ){θ̂kj (Ui ) − θ̂j (Ui )}2
k=1 i=1

+ n−1

m 
n

k=1 i=1

πk [{θ̂kj (Ui ) − θ̂j (Ui )}2

Because the proof techniques are similar, we would only provide
details for (C.5). To this end, we define uv = v/N for v =
0, 1, 2, . . . , N and some N > 0. We then have


n


 −1 

Kh (Ui − u)Zik Xij − f (u)πk (u)θkj (u)
max sup n

j u∈[0,1] 
i=1


n


≤ max max n−1
Kh (Ui − uv )Zik Xij
0≤v≤N j 
i=1



(C.6)
− E{Kh (Ui − uv )Zik Xij }

+ max max |E{Kh (Ui − uv )Zik Xij }
0≤v≤N

− {θkj (Ui ) − θj (Ui )}2 ]

m
n


{θkj (Ui ) − θj (Ui )}2
πk ⎣n−1
+

− f (uv )πk (uv )θkj (uv )|
+ max max
1≤v≤N



− E{θkj (Ui ) − θj (Ui )}2 ⎦,

max |π̂k (Ui ) − πk (Ui )| ≤ sup |π̂k (u) − πk (u)| = op (1),
i

u∈[0,1]

(C.2)

+ max max
1≤v≤N

j

(C.4)

These conclusions would be proved in details subsequently.
Step 1. We first consider (C.2) and (C.3). Under the technical
conditions (C1)–(C3), one can verify that (C.2) and (C.3) are
implied by the following three conclusions:


n


 −1 

sup n
Kh (Ui − u) − f (u) = op (1),


u∈[0,1]
i=1


n


 −1 

sup n
Kh (Ui − u)Zik − f (u)πk (u) = op (1),

u∈[0,1] 
i=1

and



n


 −1 

Kh (Ui − u)Zik Xij − f (u)πk (u)θkj (u)
max sup n

j u∈[0,1] 
i=1

= op (1).

(C.5)

(C.9)




n


 −1 

Wi − E{Wi } > ǫ
P n


i=1


(nǫ)2
≤ 2 exp −
−2 2
, (C.10)
2n h M1 + 2(h−1 M1 )(nǫ)/3

i=1

= op (1).

sup

u∈[(v−1)/N,v/N]

For fixed k, j, and u ∈ [0, 1], define a statistic Wi = Kh (Ui −
u)Zik Xij . Obviously, {Wi : 1 ≤ i ≤ n} are independent and
identically distributed random variables taking values in a
bounded interval [0, h−1 M1 ], where M1 = supt∈R1 K(t) < ∞
by condition (C1). Thus we have |Wi − E{Wi }| ≤ h−1 M1 and
var(Wi ) = E[Wi − E{Wi }]2 ≤ h−2 M12 . Based on these facts,
we apply the Bernstein’s inequality (Lin and Bai 2010) and
obtain

u∈[0,1]

(C.3)
= op (1),


n


 −1 

max n
{θkj (Ui ) − θj (Ui )}2 − E{θkj (Ui ) − θj (Ui )}2 

j 

j

|f (u)πk (u)θkj (u) − f (uv )πk (uv )θkj (uv )|

max |θ̂kj (Ui ) − θkj (Ui )| ≤ max sup |θ̂kj (u) − θkj (u)|
j,i

sup
u∈[(v−1)/N,v/N]

i=1

(C.1)



where θ̂j (Ui ) = k π̂k (Ui )θ̂kj (Ui ) and θj (Ui ) = k πk (Ui )
θkj (Ui ). By the Law of Large Numbers, we know that π̂k →p
πk . Next, note that 1 ≤ k ≤ m, m is a finite number, and
{θ̂kj (Ui ) − θ̂j (Ui )}2 ≤ 1. We then have the first term at the righthand side of (C.1) converges to 0 in probability. Thus, the uniform consistency of φ̂j is implied by the following important
conclusions (for a fixed k):

j

(C.7)



n



 −1 
[Kh (Ui − uv ) − Kh (Ui − u)]Zik Xij  (C.8)
n



i=1

k=1

j

for any u ∈ [0, 1] and any positive constant ǫ > 0. Next, by
Bonferroni’s inequality,


n



P max max n−1
Kh (Ui − uv )Zik Xij
0≤v≤N j 
i=1




− E{Kh (Ui − uv )Zik Xij } > ǫ ⎠



(nǫ)2
≤ 2(N + 1)p exp −
−2 2
2n h M1 + 2(h−1 M1 )(nǫ)/3


3α 2 ǫ 2 n3/5
.
= exp log 2 + log(N +1) + log p−
6M12 + 2αM1 ǫn−1/5


(C.11)

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

Guan, Guo, and Wang: Varying Naive Bayes Models With Applications to Classification of Chinese Text Documents

We then consider the right-hand side of (C.7),
max max |E{Kh (Ui − uv )Zik Xij } − f (uv )πk (uv )θkj (uv )|

0≤v≤N

j

 

= max max  Kh (u − uv )f (u)πk (u)θkj (u)du
0≤v≤N j


− f (uv )πk (uv )θkj (uv )






∗ 

≤ sup max  Kh (u − u )gj (u)du − gj (u )
j


is clear that {Ti : 1 ≤ i ≤ n} are independent and identically distributed random variables taking values in interval [0, 1]. Then,
we have |Ti − E{Ti }| ≤ 1 and var(Ti ) = E[Ti − E{Ti }]2 ≤ 1.
Based on these facts, the Bernstein’s inequality can be used
immediately,





n


nǫ 2
 −1 

Ti − E{Ti } > ǫ ≤ 2 exp −
.
P n


2 + 2ǫ/3
i=1

Next, by Bonferroni’s inequality, we have

u ∈[0,1]

≤ 2−1 M2 κh2 = 2−1 M2 κα 2 n−2/5 ,

(C.12)
P

where gj(u) = f (u)πk (u)θkj (u), M2 = supu∈[0,1] maxj |g̈j (u)|
and κ = t 2 K(t)dt are finite by conditions (C1), (C2), and
(C4). The detailed proof of inequality (C.12) can be found in
existing literature (Li and Racine 2006). Note that |Zik Xij | ≤ 1
and |uv − u| ≤ 1/N for any u ∈ [(v − 1)/N, v/N]. We know
immediately that
max max

1≤v≤N

j

453




n


 −1 

max n
Ti − E{Ti } > ǫ

j 



i=1

≤ 2p exp
= exp

sup
u∈[(v−1)/N,v/N]







nǫ 2
2 + 2ǫ/3




nǫ 2
log 2 + log p −
.
2 + 2ǫ/3

(C.15)

Then by condition (C3), we know that the right-hand side of
(C.15) converges toward 0. This completes the entire proof.



n



 −1 
[Kh (Ui − uv ) − Kh (Ui − u)]Zik Xij 
n


i=1

≤ max

sup

sup |Kh (u′ − uv ) − Kh (u′ − u)|

1≤v≤N u∈[(v−1)/N,v/N] u′ ∈[0,1]

≤ h−2 N −1 K̇max = α −2 n2/5 N −1 K̇max ,

(C.13)

where K̇max = supt∈[0,1] |K̇(t)| and K̇(t) is the first-order derivative of K(t). By condition (C1) we know that K̇max is finite. Similarly, by condition (C2), one can verify that the right-hand side
of (C.9) is upper bounded by N −1 M3 for some positive constant
M3 . This result, together with (C.6)–(C.9) and (C.11)–(C.13),
implies that



n



P max sup n−1
Kh (Ui − u)Zik Xij
j u∈[0,1] 
i=1




− f (u)πk (u)θkj (u) > ǫ ⎠


≤ exp log 2 + log(N + 1) + log p


3α 2 (ǫ/4)2 n3/5
6M12 + 2αM1 (ǫ/4)n−1/5



APPENDIX D. PROOF OF THEOREM 2
According to Theorem 1, after given technical conditions
(C1)–(C4) and some appropriate critical value c, we must
(d0 ) = MT ) → 1,
have P (MT ∈ M) → 1. It implies that P (M
where d0 = |MT | is the true model size. To prove Theorem 2,
the following inequality should be needed:

P (MT ⊂ M)






(d0 ) = MT P M
M
(d0 ) = MT .
≥ P MT ⊂ M

M
(d0 ) =
As a result, it suffices to show that P (MT ⊂ M|

MT ) → 1, which is equivalent to P (BIC(M(d) ) >
(d0 ) )) → 1 for M
(d) ⊂ M
(d0 ) = MT . The detailed
BIC(M
proof is divided into the following three steps.
Step 1. We start with the difference of BIC values between
two models as
(d0 ) )
(d) ) − BIC(M
BIC(M
=

+ P (2−1 M2 κα 2 n−2/5 > ǫ/4)

+ P (α −2 n2/5 N −1 K̇max > ǫ/4) + P (N −1 M3 > ǫ/4).
(C.14)
We can then set N = n3/5 . Thus, the last three terms at the
right-hand side of (C.14) are equal to 0 for sufficiently large n.
Therefore, under the condition (C3) and assume n → ∞, we
have the right-hand side of (C.14) converges toward 0.
Step 2. We next consider (C.4). Define a statistic Ti =
{θkj (Ui ) − θj (Ui )}2 , which is a nonparametric function of Ui . It

2
n

n




(d ) \M
(d) l=1
j ∈M
0

Ĝj (Ul ) − (m − 1)(d0 − d)

log(nh)
,
nh
(D.1)

where the notation Ĝj (Ul ) is represented by

Ĝj (Ul ) =

m




θ̂kj (Ul )
π̂k (Ul )fˆ(Ul ) θ̂kj (Ul ) log
θ̂j (Ul )
k=1

1 − θ̂kj (Ul )
+ {1 − θ̂kj (Ul )} log
.
1 − θ̂j (Ul )

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

454

Journal of Business & Economic Statistics, July 2014

For simplicity, some tedious manipulation of (D.1) is omitted.
Next, define a nonparametric function Gj (u) as

m

θkj (u)
πk (u)f (u) θkj (u) log
Gj (u) =
θj (u)
k=1
+ {1 − θkj (u)} log

1 − θkj (u)
.
1 − θj (u)

u∈[0,1]

≤ M4 sup A − a1 ,

u∈[0,1]

where C1 and C2 are some positive constants. Because the proof
techniques are similar but not particularly difficult, the detailed
proof of (D.4) will not be reproduced here. For a given u, define
g(A)  Ĝj (u) and g(a)  Gj (u), where A = (A1 , . . . , A2m )⊤
and a = (a1 , . . . , a2m )⊤ . By Taylor’s theorem, it follows that
|Ĝj (u) − Gj (u)| = |g(A) − g(a)| = |(A − a)⊤ ġ(ã)|
≤ ġ(ã)∞ A − a1 ,
where ġ(·) is the first-order gradient of g(·) and ã =
(ã1 , . . . , ã2m )⊤ lies between a and A. Under condition (C4),
we know that ν 3 ≤ ak < fmax for every k and u ∈ [0, 1]. Thus,
on the event


An = sup A − a1 ≤ 2−1 ν 3 ,

(D.5)

u∈[0,1]



where u∗ = argmaxu∈[0,1] Ĝj (u) − Gj (u). We next consider
another two events






Bn =
sup Ĝj (u) − Gj (u) > ǫ ,
u∈[0,1]

Cn =

(D.3)

Now, we provide some details for (D.2) first. After some
tedious manipulation, we get the following equalities, for fixed
j and u ∈ [0, 1].

m

Ak′
Am+k m

m k =1
Am+k log
Ĝj (u) =
Ak k′ =1 Am+k′
k=1


(Ak − Am+k ) m
k ′ =1 Ak ′

,
+ (Ak − Am+k ) log
Ak m
k ′ =1 (Ak ′ − Am+k ′ )

m

am+k m′ ak′
am+k log m k =1
Gj (u) =
ak k′ =1 am+k′
k=1


(ak − am+k ) m
k ′ =1 ak ′
+ (ak − am+k ) log m
,
ak k′ =1 (ak′ − am+k′ )


where Ak = n−1 i Kh (Ui − u)Zik , Am+k = n−1 i Kh (Ui −
u)Zik Xij , ak = f (u)πk (u) and am+k = f (u)πk (u)θkj (u), for
1 ≤ k ≤ m. By the detailed proof of (C.14) in Appendix C,
it’s not difficult to get the following exponential inequality under conditions (C1), (C2), and (C4). Given k(1 ≤ k ≤ 2m), for
any ǫ > 0 and sufficiently large n,


P sup |Ak − ak | > ǫ ≤ C1 exp(−C2 n3/5 ǫ 2 ), (D.4)

u∈[0,1]

sup |Ĝj (u) − Gj (u)|

u∈[0,1]

= |Ĝj (u∗ ) − Gj (u∗ )| ≤ M4 A − a1 |u=u∗

By condition (C4) and Lemma 1, it is clear that minj ∈MT
E{Gj (U1 )} ≥ mCν 3 τ and E{Gj (U1 )} = 0 for j ∈
/ MT .
Step
2.
The
next
thing
to
prove
is
the
consistency
result

n−1 nl=1 Ĝj (Ul ) →p E{Gj (U1 )} for every j. By triangle inequality, it can be implied by the following two conclusions:


sup Ĝj (u) − Gj (u) = op (1),
(D.2)
 n


1 


Gj (Ul ) − E{Gj (U1 )} = op (1).


n
l=1

we have 2−1 ν 3 ≤ ãk < fmax + 2−1 ν 3 for every k. We then immediately know that ġ(ã)∞ is bounded by some positive constant M4 . Subsequently, on the event An , we also have




sup A − a1 > M4−1 ǫ .

u∈[0,1]

By inequality (D.5), we have P (Bn |An ) ≤ P (Cn |An ). Thus,
P (Bn ) can be bounded by




≤ P (Bn |An )P (An ) + P Acn ≤ P (Cn |An )P (An ) + P Acn


(D.6)
≤ P (Cn ) + P Acn .
Let 0 < ǫ ≤ 2−1 M4 ν 3 , the right-hand side of (D.6) can be
bounded by


−1 −1 −1
≤ 2P (Cn ) ≤ 2P max sup |Ak − ak | > 2 m M4 ǫ .
1≤k≤2m u∈[0,1]

(D.7)
By Bonferroni’s inequality, the right-hand side of (D.7) can be
further bounded by


2m

−1 −1 −1
P sup |Ak − ak | > 2 m M4 ǫ .
≤2
k=1

u∈[0,1]

Together with inequality (D.4) and m is a finite number, we can
get the following uniform consistency result. For any 0 < ǫ ≤
2−1 M4 ν 3 and sufficiently large n,


P sup |Ĝj (u) − Gj (u)| > ǫ ≤ C3 exp(−C4 n3/5 ǫ 2 ),
u∈[0,1]

(D.8)
where C3 and C4 are some positive constants.
We next provide a detailed proof for (D.3). By condition (C4),
it is clear that |Gj (Ul ) − E{Gj (U1 )}| < M5 and var{Gj (Ul )} ≤
M52 , where M5 = fmax (1 − ν){log(1 − ν) − log ν}. Based on
these facts, the Bernstein’s inequality can be used,


 n
1 



P 
Gj (Ul ) − E{Gj (U1 )} > ǫ
n

l=1

≤ 2 exp






nǫ 2
,
2M52 + 2M5 ǫ/3

(D.9)

for any ǫ > 0. Thus, combining the conclusions of (D.8) and
(D.9), for any 0 < ǫ ≤ 2−1 M4 ν 3 and sufficiently large n, the

Downloaded by [Universitas Maritim Raja Ali Haji], [UNIVERSITAS MARITIM RAJA ALI HAJI TANJUNGPINANG, KEPULAUAN RIAU] at 21:01 11 January 2016

Guan, Guo, and Wang: Varying Naive Bayes Models With Applications to Classification of Chinese Text Documents

following inequality holds,


 n
1 



Ĝj (Ul ) − E{Gj (U1 )} > ǫ
P 
n

l=1
≤ C5 exp(−C6 n3/5 ǫ 2 ),

and No. 11271032), Fox Ying Tong Education Foundation, the
Business Intelligence Research Center at Peking University, and
the Center for Statistical Science at Peking University.
(D.10)

where C5 and C6 are some positive constants.
(d)
(d0 ) \ M
Step 3. For simplicity, we use the notation S = M
throughout the rest of proof. According to Equation (D.1), we
immediately know that








(d) ≤ BIC M
(d0 )
P BIC M


n

1
log(nh)
⎠.
Ĝj (Ul ) ≤ (m − 1)(d0 − d)
= P⎝
n j ∈S l=1
2nh
(D.11)

Under the assumption h = αn−1/5 , for any ǫ > 0 and sufficiently large n, the probability (D.11) can be bounded by


n

1
≤P⎝
(D.12)
Ĝj (Ul ) < (d0 − d)ǫ ⎠ .
n j ∈S l=1
Then by Bonferroni’s inequality, the right-hand side of (D.12)
can be bounded by


n
1
Ĝj (Ul ) < ǫ
≤ P min
j ∈S n
l=1

 n

1

Ĝj (Ul ) < ǫ .
(D.13)
P
n l=1
j ∈S

(d0 ) = MT . ToLet 0 < ǫ ≤ min{2−1 mCν 3 τ, 2−1 M4 ν 3 } and M
gether with inequality (D.10) and minj ∈MT E{Gj (U1 )} ≥
mCν 3 τ , the right-hand side of (D.13) can be further bounded
by

 n

1
Ĝj (Ul ) < E{Gj (U1 )} − ǫ
P

n l=1
j ∈S


 n
1 





Ĝj (Ul ) − E{Gj (U1 )} > ǫ
P