9a06c etri manuscript

3/14/2017

https://etrij.etri.re.kr/etrij/author/Confirmation.do
User : akhriza
Logout

ABOUT ETRI JOURNAL

ARCHIVE

e­SUBMISSION

PEER REVIEW
 > Paper Status > Submission Information

Submission Information
Article information

Update Info.

Authors information


Manuscript Type

Regular Paper : RP1703­0175

Title

The Novel Push­Front Fibonacci Windows Model for Finding the Emerging Patterns with Better Completeness and Accuracy

Abstract

Mining emerging patterns (EPs) has been well­studied. Principally, to find EP in streaming transaction data, the streaming is first divided
into some time­windows containing a number of transactions; itemsets are generated from transactions in each window and then the
emergent of itemsets is evaluated between two windows. In tilted time windows model (TTWM), it is assumed that people need
support with finest accuracy from the most recent windows, and accept coarser accuracy from the older windows. Therefore, limited
array’s elements are used to keep all support data in a way to condense old windows by merging them inside one element. Capacity of
elements in accommodating the windows is modeled using a particular sequence number, where some front elements’ capacity is set to
one window in order to serve most accurate support. However, in a stream, as new data come, current element’s updating algorithms
create many null elements in array hence data become incomplete. This article proposes the Push­front TTWM and Push­front Fibonacci
windows model (PF­FWin) as solutions. Experimental work shows that PF­FWin is able to grab more EPs than existing TTWM based

models.

Keywords

Data stream mining, Emerging patterns, Fibonacci numbers, Frequent Patterns, Time windows models

Research Field

ICT Strategy

Self Evaluation

Q) What do you think are the innovative ideas, new results, new applications or new developments of high interest that significantly
impact the field?
A) Contributions are novel time windows models: Push­Front Tilted Time Windows Model and Push­Front Fibonacci Windows Model which
improve the accuracy of traditional Tilted­time windows model, and its derivation models in finding the Emerging Patterns

Resubmission

 


Related Papers

 

Remarks

 

Submitted File

Original on Mar. 08, 2017 ETRI_Manuscript_v2.pdf

Copyright

copyright_form_2.pdf (551 KB)

ETRI Journal Editorial Office 

Electronics and Telecommunications Research Institute

218 Gajeongno, Yuseong­gu, Daejeon, 34129, Rep. of Korea
etrij@etri.re.kr http://etrij.etri.re.kr Phone: +82 42 860 6127

https://etrij.etri.re.kr/etrij/author/Confirmation.do

1/1

3/14/2017

https://etrij.etri.re.kr/etrij/author/Confirmation.do
User : akhriza
Logout

ABOUT ETRI JOURNAL

ARCHIVE

e­SUBMISSION

PEER REVIEW

 > Paper Status > Submission Information

Submission Information
Article information

1st Author 

Update Info.

Authors information

Corresponding Author

Name

 Tubagus Mohammad Akhriza

Position

 First Author


Organization

  Pradnya Paramita School of Informatics
Management and Computer

Department

 Informatics Technique

Address

 Jl. LA Sucipto 249A, Malang, 65125, East Java (Jawa Timur)

Phone number

 +62­822­9910­6869

FAX


 

Mobile

 

E­mail

 akhriza@stimata.ac.id

ORCID

 0000­0002­6674­1704

2nd Author
Name

Ying Hua Ma

Position


Co­Author

Organization

Shanghai Jiao Tong University

Department

School of Information Security Engineering

Phone number

+86­21­34205025

E­mail

ma­yinghua@sjtu.edu.cn

Name


Jian Hua Li

Position

Co­Author

Organization

Shanghai Jiao Tong University

Department

School of Information Security Engineering

Phone number

+86­21­34205025

E­mail


lijh888@sjtu.edu.cn

ORCID
3rd Author

ORCID

ETRI Journal Editorial Office 

Electronics and Telecommunications Research Institute
218 Gajeongno, Yuseong­gu, Daejeon, 34129, Rep. of Korea
etrij@etri.re.kr http://etrij.etri.re.kr Phone: +82 42 860 6127

https://etrij.etri.re.kr/etrij/author/Confirmation.do

1/1

The Novel Push-Front Fibonacci Windows Model
for Finding the Emerging Patterns with Better

Completeness and Accuracy



sources such as censors and social media in information era [1].
Making sense of these huge data streams is a task that
Mining emerging patterns (EPs) has been well-studied. continues to rely heavily on human judgement and therefore,
Principally, to find EP in streaming transaction data, the recognizing patterns contained in the streaming becomes a
streaming is first divided into some time-windows crucial job [2]-[3]. Among the patterns, the emerging patterns
containing a number of transactions; itemsets are (EPs) are one which is worth to mining as they reflect the
generated from transactions in each window and then the itemsets’ popularity change from one period of time to another.
emergent of itemsets is evaluated between two windows. The problem of mining EP has been well-studied [4]-[7] and
In tilted time windows model (TTWM), it is assumed that EP concept also has been broadly implemented to solve
people need support with finest accuracy from the most problems in many areas such as to find change in networks [6]
recent windows, and accept coarser accuracy from the and customer behavior in online markets [7].
older windows. Therefore, limited array’s elements are
The streaming can be first divided and loaded into several
used to keep all support data in a way to condense old time-windows to find EP in a streaming data. A window
windows by merging them inside one element. Capacity of contains several transactions and the itemsets are generated
elements in accommodating the windows is modeled using from transactions in each window. An itemset is said emerging
a particular sequence number, where some front elements’ from two ordered pair windows W1 to W2 if its support’s
capacity is set to one window in order to serve most growth rate from W1 to W2 satisfies minimum support growth
accurate support. However, in a stream, as new data come, (mingrowth) threshold [4]. A mechanism called tilted-time
current element’s updating algorithms create many null windows model (TTWM) is proposed to maintain itemsets’
elements in array hence data become incomplete. This support recorded in all windows [8]. The model assumes that
article proposes the Push-front TTWM and Push-front people need supports with finest accuracy only from some
Fibonacci windows model (PF-FWin) as solutions. recent windows, but they accept coarser accuracy from older
Experimental work shows that PF-FWin is able to grab windows [9]-[10]. The challenge is how to condense supports
more EPs than existing TTWM based models.
data found in a big number of windows inside smaller number
Keywords: Data stream mining, Emerging patterns, of elements of array, i.e. sup[i], in effective and efficient way.
Time windows models
The maximum number of windows that can be merged into
array’s elements, or element’s capacity, can be modeled using a
I. Introduction
certain numbers sequence. Element sup[0] is particularly
storing support in currently being processed window, but with
Data are being produced and streamed continually in a big
a purpose to serve the finest support values, some front
volume and high velocity by tremendous number of unusual
elements’ capacities are set to contain only one window. Two
models derived from TTWM: Logarithmic tilted-time
Abstract

windows model (LWin) uses geometric-like sequence G = {1;
1[1]; 2[2]; 4[4]; 8[8]; so on}. Therefore, N windows are
accommodated in (1+2log(N))-elements array [10]; while
another model [11], Fibonacci windows model (FWin) uses
Fibonacci sequence F = {1, 1, 2, 3, 5, so on} to organize the
elements’ capacities, thus N windows are fit to n–elements
array, (N ≥ n), in a relation given in Eq. (1)
n–1

N = ∑ X. sup[i] .vol
i=0

Sup[i].vol represents element’s volume i.e. number of
windows being merged inside sup[i], which is either equal to
null or F[i]. Sequence of elements’ volume is particularly called
the volume sequence.
The main problem of traditional TTWM hence LWin and
FWin are the occurrence of null elements even when they
should provide the most accurate support. This situation is a big
disadvantage in the process to find EP which checks the
support’s growth between elements sup[0] and sup[j], j > 0. As
a consequence of such data incompleteness, EPs cannot be
found at some timestamps hence the trend of pattern’s
popularity.
This article, firstly, introduces a novel model, called Pushfront Tilted-time Windows Model (PF-TTWM), as the solution
for data incompleteness of TTWM. A push-front mechanism is
applied to eliminate null, and many front elements can even
provide support with finest accuracy. Secondly, this work also
proposes a novel Push-front Fibonacci windows model (PFFwin), developed using PF-TTWM’s concept. Fibonacci
sequence F is also used to organize element’s capacity in order
N windows can always be accommodated in n–elements array
(N ≥ n) by relation given in Eq. (1). However, X.sup[i],vol
∈{1, 2, …, F[i]} in PF-FWin, while in FWin, X.sup[i],vol ∈
{0, F[i]}.
Performance of PF-FWin is compared to FWin and LWin by
processing one million transactions dataset generated using
IBM Quest data generator. Process of accommodating the
streaming N windows data into n–elements array is modeled
using a deterministic finite automaton. Transactions are
processed by online streaming method. The new window is
created when the number of processed transaction in a window
exceeded 10K transactions. The emergent of X is evaluated by
inspecting its support stored in X.sup[0] and another X.sup[j], j
> 0. X is mined out as EP if X’s support and support growth
meet the given thresholds. As the results, number of EP found
using PF-FWin, is up to 90% and 275% bigger than using
LWin and FWin respectively. These results prove that better
completeness and accuracy offered by PF-FWin can improve
the number of EP found by this model with the runtime
performance of PF-FWin is comparable to the other models.

In addition to PF-TTWM and PF-FWin novelties, another
contribution of this work is a suggestion about data
completeness and accuracy as two main quality characteristics
that should be possessed by the models. The rest of article is
organized as follows. Section 2 contains review on literatures
related to our work. Section 3 defines the proposed solution.
Section 4 discusses experimental work and its result, while
Section 5 draws the conclusion of this work.

II. Literature Review
Discussion in this section includes the problem of EP mining,
TTWM and two state-of-arts windows models derived from
TTWM, i.e. Logarithmic TTWM and Fibonacci Windows
Model

1. Problem of Mining the Emerging Patterns
The problem of mining EP has been well studied since 1999
[5]-[7], and the implementation of EP in many areas are found
in literatures as well. For example, implementation in medical
area is found to distinguish similar diseases [12], to classify
cancer diagnosis data [13], and for profiling of leukemia
patients [14]. In networking environment, EP is implemented
to find significant difference in network data stream [6], or for
masquerader detection [15].
EP is a class of frequent pattern, where the itemsets are
generated using association rule mining. An itemset X is said
frequent in dataset D if X’s support the D,
���� X ≥ minsupp , a user-given minimum support
threshold [16]. In other words, EP contains related items
because their appearance altogether in D is frequent with
respect to minsupp.
EP is formally defined as follows [5]. Given an ordered pair
of datasets D and D , set I contains all items in both datasets,
and X  I is an itemset. Let supi X denotes support of X in
Di , growth rate of X from D to D denoted as GR(X), it is
defined as Eq. (2)
0

GR X =
sup2 X
sup
{ 1 X

if sup1 X =0 and sup2 X =0
if sup1 X =0 and sup2 X ≠0
otherwise

Given  > 1 is a mingrowth, an itemset X is said to be an emerging patterns (or simplify -EP or EP) from D to D if
GR(X) ≥ . EP mining problem is, for a given , to find all EPs. In a special case when sup1 X =

and GR(X) ≥ , X

is called as Jumping EP (JEP).
EP mining concept was initially applied to two static datasets
of D and D . The data in both datasets come from same

source, but collected at two different time stamps; for example,
transactional data of a supermarket, recorded in different
months. Alternatively, D and D are also possible from
different sources, for example, transactional data from different
supermarkets. However, this notion has been adapted to two
classes of dynamic datasets accordingly, such as in data stream
environment. The examples are to find the change of network
by inspecting the IP traffic patterns captured from network
routers [6], or to approximate EP from two different class
streaming data [7].
EP can also be found in one class of streaming or temporal
dataset. In this case, the streaming is divided into several timewindows. Each window contains a number of transactions. EP
is evaluated between two windows W and W , where W
is the most recent window. A method named Dual Support
Apriori for Temporal data is proposed to find the emerging
trends from temporal dataset using Sliding windows model [4].
But long term itemsets popularity evaluation at multiple time
windows cannot be performed because this model only
provides two windows: the current and previous one

accurate supports in last one hour processes. As the answer,
four elements i.e. sup[0,0,0,0] are initially created to
accommodate supports from four windows of ¼-hour data
processing respectively. Fig. 1 demonstrates TTWM when
updating the elements. After all ¼-hour elements are occupied,
or X.sup[1,1,1,1], as support found in current window will be
put in sup[0], supports in ¼-hour elements firstly are merged
and stored into the 1-hour element. Next, all ¼-hour elements
are nullified and sup[0] is now ready to store the current
support, or the array becomes sup[1,0,0,0,4] (Fig. 1). This
approach can facilitates one day data processing only by 4 (¼hour element) + 24 (1-hour elements) = 28 elements, instead of
using 4  24 = 96 elements to store all ¼-hour processing
results for 24 hours. It is an efficient solution [7]-[9].

3. The Logarithmic Tilted-Time Windows Model

A model derived from TTWM, named Logarithmic TTWM
or LWin was firstly introduced in [10] with the goal to store
and maintain supports of itemsets for long term. Decision
maker got benefit from this approach since time-related query
about pattern’s popularity trend can be evaluated, such as
2. The Tilted-Time Windows Model
patterns that popular once, seasonally popular or everlasting
TTWM was proposed to store and maintain itemset’s popular. Such benefit cannot be harvested from the other data
supports from certain number of windows for long term. As windows models [9], [10], such as sliding windows [4], [17],
explained in [8]-[10], people actually need the most accurate [18], damped windows [19], [20] and landmark windows
supports only from some recent windows, while coarser models [21], [22].
accuracy is acceptable from older windows. Therefore, support
LWin is designed to deal the storing results of FP mining in
data in big number of windows should be condensed efficiently offline data stream. The streaming transactions are divided into
into a smaller number of array’s elements.
N windows, containing uniform number of transactions. The
Technically, each itemset X stores its support data in an n– logarithmic approach is proposed to reduce number of
elements array, denoted as X.sup[i], or sup[i] to simplify. Each elements more significantly, compared to the original TTWM.
sup[i] has a capacity, i.e. maximum number of windows that N windows can be accommodated in only {1+2log(N)}
can be merged inside respective element, while sup [0] elements. For instance, 1024 windows are fit in only 11
particularly stores support in currently being processed window. elements, which is very memory-efficient.
LWin uses a sequence, G = {1}∪{1[1], 2[2], 4[4], 8[8], so
TTWM models the capacities of sup[i] into a particular
numbers sequence i.e. the capacity sequence. The smaller the on}, to organize the capacity of sup[i] elements, i  0. The right
capacity should create finer accuracy of support. Capacity of side of union symbol is actually a geometric sequence with
some front elements is set to one window to create finest ratio r = 2, but aside of each element (called: main element)
support’s accuracy. The rest elements have different capacities, sup[i], there is an intermediate element, named sub[i] in this
but the capacity usually sup[i] ≤ sup[i], i< j.
article, having the same capacity with its main element i.e. G[i].
Sup[0] has sub[0], but not written here as it is always set to null
[7]-[9]. Those articles also explain that the sequence’s ratio is a
positive integer and is also changeable in according with the
analyst need, although they only explain how LWin works
using r = 2. We can see three 1s in sequence G which
Fig. 1. Elements Updating in Tilted-Time Windows Model
respectively represents the most accurate support stored in
sup[0], sup[1] and sub[1].
Given 2n–elements array, these are sup[i] and sub[i], i = 0 to
Supposed the transactions are collected and then processed
n–1.
The updating mechanism of array’s elements in LWin
in every ¼-hour, but the manager requires only the most

applies the following procedures:
(1) Sup[0] is always filled with the support in current window
(2) Sup[1] is filled only by shifting Sup[0] into it
(3) Sub[i] is filled only by shifting Sup[i] into it, while sup[i], i
> 1, is filled by sup[i] = sup[i–1] + sub[i–1]; and sub[i–1]
is nullified afterward.
(4) The first updating target is a sub[i] = 0, with i is the lowest
index, and the updating is continued to its main element
sup[i], followed by updating all sup[j] and sub[j], j < i,
using procedure number (1), (2) and (3).
(5) When all sup[i] and sub[i] are occupied; increment n = n +
1, and create a new pair elements sup[n] and sub[n] at the
back of array. Sup[n] becomes the first updating target,
followed by all sup[i] and sub[i], i < n, using procedure (1),
(2) and (3).
However, LWin must overcome two situations in its
elements’ updating process. If N is exactly � , then N
windows are fit to (n+1)-main–elements array. E.g. for N = 8
thus n = 3, eight windows are exactly accommodated in four
elements, where the capacity sequence is like {1, 1[], 2[], 4[]};
[] represents an empty sub[i]. However, if N is nearly � , then
N windows will be accommodated by (n+1)-main–elements
array, plus some intermediate element(s). E.g. for N = 9, thus n
 3, 9 windows will be condensed into 4-elements array plus 1
intermediate element where the formed capacity sequence is
like {1, 1, [1], 2, 4}.
Illustration of volume sequence when X if found in N = 1 to
8 windows is given in Fig. 2. Dashed line, bold arrow and thin
arrow respectively represent pair of sup[i]-sub[i], first and next
done steps. From such illustration, it is known that number of
elements hence the memory needed by each itemset is actually
maximum 2  {1 + 2log(N)}.

Fig. 2. Volume Sequence and Updating Steps in LWin

4. The Fibonacci Windows Model
FWin or Fibonacci Windows Model was introduced with the
goal to improve the memory-efficiency issue occurred in LWin
usage, and also to deal with the online data stream mining [11].
The main differences between FWin and LWin is that instead
of using “double” geometric sequence, FWin uses Fibonacci
sequence, F = {1, 1, 2, 3, 5, so on} to organize elements’
capacity, thus no intermediate element needed in updating
process. In addition, distance between capacities in F[i] is
smaller than distance in G[i], i>2. According to the principle of
TTWM, FWin provides better accuracy than LWin does.
Given n–elements array, sup[i], i = 0 to n–1, elements

updating procedures in FWin are described as follows:
(1) Sup[0] is used to accommodate support in current window
(2) Sup[1] is updated by shifting sup[0] into it
(3) Sup[i], i > 1, is updated by merging sup[i–1] and sup[i–2]
into it, or sup[i] = sup[i–1] + sup[i–2]; and then sup[i–1] is
nullified afterward. By decrementing i = i – 2, procedure
(3) is repeated.
(4) The first updating target is a sup[i] = 0, with i is the lowest
index, and the updating is continued to all sup[j], j < i,
using procedure number (1), (2) and (3).
(5) When there is no sup[i] = 0 left, increment n = n + 1, so a
new element is created at the array’s back, sup[n]. Since
sup[n] = 0, it becomes the first updating target by using
procedure (4).
Illustration of volume sequence updating when X found in N
= 1 to 7 windows is given in Fig. 3. Bold arrow and thin arrow
respectively represent first and next done steps.

Fig. 3. Volume sequence and Updating Steps in FWin

Using FWin, N windows can be accommodated in n–
elements array, N  n, such as Eq. (1). Value of sup[i].vol is
either 0 or F[i]. For instance, 15-elements array can
accommodate 1596 windows. If one window is equal to oneday data processing, 15-elements array is efficiently fit for 4.37
years data processing, which is also very efficient.
However, both LWin and FWin still possess some
shortcomings particularly in data incompleteness issue.
Elements sup[1] or sub[1] that should provide the most
accurate support, even still possible to be empty. Consequently,
queries about support values at several past timestamps cannot
be answered. EP in those timestamps also cannot be evaluated
hence also the change of itemset’s popularity. Solution on these
issues is offered in this work via the proposal about a novel
Push-Front Tilted-Time Windows Model and also its
derivation model, i.e. Push-Front Fibonacci Windows Model
which are explained in next section.

III. The Proposed Methodology
This section explains details of PF-TTWM, followed by the
proposed Push-Front Fibonacci Windows Model. An
automaton for PF-Fwin as updating mechanism is given in this
section, and then, EP mining criteria are explained.

1. The Push-Front Tilted-Time Windows Model

Illustration of the proposed Pushed-front tilted-time windows
– PF-FWin – model is given in Fig. 4. The main difference
between PF-TTWM and TTWM is when all elements reached
their respective capacity, PF-TTWM inserts a new element at
array’s front, so there is no null element created. In a case given
in previous Subsection II-2, the push-front operation makes
X.sup = [1,1,1,1] became X.sup = [1,1,1,1,1], instead of
X.sup[1,0,0,0,4], such as produced by TTWM. Support in
sup[3] becomes sup[4] and represents 1-hour element, while
new sup[0] will be filled with new data just arrived.
The main goal of tilting the time-windows in PF-TTWM is
still the same as TTWM’s goal, i.e. to reduce memory used to
store the supports; but with an improvement to eliminate null
elements in PF-TTWM. On the contrary, many front elements’
volumes are filled with 1. Such improvement makes PFTTWM can provide better data completeness and accuracy,
compared to traditional TTWM.

step means that after sup[n–1].vol reached its capacity,
then the next updating target is sup[n–2], and so on.
(3) If all sup[i].vol reached their capacity, i.e. the volume
sequence = Fibonacci sequence, a new element is pushed
at front of array, sup[0], while current array’s indices are
all incremented by one accordingly.
Fig. 5 describes the illustration of updating mechanism and
the change of volume sequence in PF-FWin. Bold arrow and
thin arrow respectively represent first and next done steps,
where pf-arrow represents the push-front operation and
merged-arrow shows the merging process sup[i] += sup[i–1].

Fig. 5. Volume sequence and Updating Steps in PF-FWin

3. The PF-FWin Updating Automaton

Fig. 4. The Proposed Push-Front Tilted-Time Windows Model

Programmatically, push-front operation is implementable
directly in C++ language using pushfront() method. It is a
member function of Deque (double-ended queue) container.
Amortized time complexity of pushfront() is O(1), which is
very efficient, because the element insertion is done at the
beginning of the data structure.

2. The Push-Front Fibonacci Windows Model
Notion of PF-TTWM is used to develop PF-FWin with a
purpose to improve FWin’s performance to find EP. Using PFFWin, N windows, in which X is found, can always be
accommodated into n–elements array (N ≤ n) by referring Eq.
(1), where X. sup[i] .vol ∈ { , , … , F[i]}.
This approach differs from FWin’s (also LWin’s) approach in
term that volume of sup[i] is not only null or F[i] (also G[i] in
LWin). Given n–elements array, sup[i], i = 0 to n–1, the
complete updating procedures are below:
(1) For i = n–1 to 0, do /*Here, the last element i.e. sup[n–1]
becomes the first updating target */
a. Sup[i] += sup[i–1]
b. Shift sup[i–1]  sup[i–2],…,  sup[1]  sup[0]
c. Particularly, if i = 0, then Sup[0]  support in current
window.
(2) If volume of sup[n–1] is equal to a Fibonacci element
F[n–1] then n = n–1, and procedure (1) is repeated. This

PF-FWin’s updating procedures above are modeled in
deterministic finite automaton called PFwin Automaton
(PFwinA). Formally the automaton is defined as PFwinA(S, I,
S0, Z, f), with tuples are explained as follows:
(1) State set S = {Uc: update current element i.e. sup[m], m =
n–1, Up: update previous elementi.e sup[m–1], Pf: push
front the array}
(2) Input set I = {isFiboSeq, isNotFiboSeq, isFiboElement,
isNotFiboElement},
(3) A start state S0 = {Pf}
(4) Stop states set Z = {Uc, Up, Pf}
(5) Next-State function f: S  I  S, a function to direct the
current state to the next state, after reading the input.
Transition table of function f is given in Table 1, while the
transition graph is given in Fig. 6. Additionally, Table 1
also lists the step taken after a current state read an input.
PFwinA describes that four conditions may be appearing
during the updating process, and they are defined as the input
of automaton. First condition is called isFiboSeq. It means the
volume sequence generates a Fibonacci sequence or
sup[i].vol = F[i]. Second condition is isNotFiboSeq which is
true if sup[n–1].vol = F[n–1] but sup[i].vol  F[i]. Third
condition is isFiboElement, which is reached when sup[n–
1].vol = F[n–1]; Last condition is isNotFiboElement to explain
that at sup[n–1].vol < F[n–1]. If a condition is met after the
current state is reached, step given in last column of Table 1 is
applied. Current element being updated is always sup[m],
where m = n – 1. When previous element should be updated,
m is decremented by one, m = m – 1 and sup[m] is updated;
and current element is now reset to sup[m].

The iteration is started by pushing the first element at array’s
front, thus PFwinA is started from Pf state, where N = 1, or
sup[n–1].vol = 1. The automaton can stop at any state and then
it is waiting for the next new window addition.

{sup[i].vol} = a Fibonacci sequence generates sup[n] = F[n], n
> 1, i = 0 to n–1.
Proof: supposed volume sequence = a Fibonacci sequence,
thus {sup[i].vol} = {1, 1, 2, …., F[n–2], F[n–1]}. The first
push-front makes volume sequence = {1}∪{1, 1, 2, .., F[n–2],
F[n–1]}. Summation of F[n–1] + F[n–2] creates the next
Fibonacci element F[n], and stored in sup[n].

4. Finding EP with PF-FWin
Fig. 6. PFWin Automaton Graph Transition
Table 1. State Transition Table for PFwin Automaton
No

Current
State

Input

Next
State

1

Pf

isFiboSeq

Pf

2

Pf

isnotFiboSeq

Uc

3

Uc

isFiboSeq

Pf

4

Uc

isFiboElement

Up

5

Uc

isNotFiboElement

Uc

6

Up

isFiboSeq

Pf

7

Up

isFiboElement

Up

8

Up

isNotFiboElement

Uc

Step taken
Push-front a new
element sup[0]
Update current
element sup[m]
Push-front a new
element sup[0];
Update previous
element: m = m–1;
update sup[m]
Update current
element sup[m]
Push-front a new
element sup[0]
Update previous
element: m = m–1;
update sup[m]
Update current
element sup[m]

To proof that PFwinA can generate volume sequence as
defined in Eq. (1) using push-front approach, the following
properties are developed and their proofs are given.
Property 1: when Fibonacci sequence has not been
developed yet, PFwinA is repeatedly updating sup[m] until
sup[m].vol = F[m]. We have to proof that for sS – {Pf}, f(s,
isNotFiboElement) = t  t = Uc; Pf is excluded since it
represents a state where isFiboSeq is true
Proof: by contradiction, assume sS – {Pf} where f(s,
isNotFiboElement) = t  t  Uc. By inserting all sS – {Pf}
into next-state function above: f({Up, Uc}, isNotFiboElement)
= f(Up,isNotFiboElement)∪f(Uc,isNotFiboElement) = {Uc};
since all function results t=Uc, thus assumption is not correct,
therefore Property 1 is correct.
Property 2: After sup[n–1].vol=F[n–1], PFwinA continues
to update sup[n–2]. The proof is clearly given in Table 1 that
sS, f(s, isFiboElement) = t  t = Up, which means,
PFwinA is updating the previous element i.e. by decrementing
m=m–1, and then updating sup[m].
Property 3: the second push-front operation taken after

Finding EP over the data stream with PF-FWin (and other
TTWM models) is to cope with big supports as the merging
results of some supports. As a consequence, GR(X)
formulation and mingrowth become two main keys to affect to
the number of EP that can be found from the streaming.
This article evaluates the emergent of X by inspecting
X.sup[0], i.e. the current support and another X.sup[j], j > 0. X
is said an -EP, > 0, if two following criteria are met:
(1) X is frequent in current window, or X.sup[0] minsupp
(2) GR(X) from sup[j] to sup[0] , where GR(X) is defined
in this work as Eq. (3):
GR X =

(X. sup[0] − X. sup[j] )
X. sup[j]

Such GR(X) formula is slightly different from one defined in
original work (introduced in Section II), because sup[j], j >1
possibly contains a big support value. Thus GR is defined so
that X’s support growth is said significant if it reaches a certain
percentage, e.g. 20%.

IV. Experimental Works, Results and Discussions
The experimental works are conducted to evaluate
performance of PF-FWin compared to LWin and FWin to find
EP in online transaction data streaming. EP quantity from three
methods becomes performance indicator of experiment. Online
streaming means that itemsets are directly generated from each
transaction as it arrives. Each generated itemset will also be
directly evaluated whether it is an EP or not using criteria
explained in previous section, Subsection 4. Therefore, it is also
necessary to evaluate runtime performance of each model to
organize the array when accommodating supports found in a
big number of windows.
Programs of three models are developed using C++ and
compiled in Visual Studio Express 2013’s environment. Those
programs are individually run on three different PCs but with a
same specification i.e. Intel Core i3 CPU @3.30Ghz-3.30GHz,
2GB RAM, under Windows 7 32-Bit. A container in C++’s
Standard Template Library, i.e. deque, is used to store the
supports in all models. A member function of deque, i.e.

pushfront(), is used to run push-front operation in PF-FWin’s
elements updating process.

1. Experimental Design

or N is almost reaching 2n, such as N = 15. Volume sequence
for N = 15 in using LWin is {1, 1[1], 2[2], 4[4]}, and as seen,
all sup[i] and sub[i], i> 0, are filled; thus when EP evaluation is
performed, the LWin algorithm must double-check all
elements. As comparison, volume sequence of FWin and PFFWin when N = 15 is {1, 1, 2, 3, 0, 8} and {1, 1, 1, 1, 3, 8}
respectively, but number of EP found by LWin, FWin and PFFWin when N = 15 is 15, 11 and 66 EP respectively. The
number of element in array of all models is actually quite
similar. The reason of PF-FWin produces more EP than the
other is because the presence of many 1s in front of the array.
These instances show that better support’s accuracy of PFFWin has improved the number of EP that can be found by this
model.

Dataset of experiment is generated by IBM Quest Data
Generator. It contains one million transactions, and 243 items,
where one transaction has 3-10 items. New window is created
after 10K transactions are processed in a window, thus there are
100 windows that must be organized by each model. Minsupp
is set to 100, while mingrowth  = 20%. The other higher
mingrowth values, e.g. 50%, have also been experimented, but
number of found EP is not so big, thus it would not be too
interesting to be discussed. Nevertheless, another experiment is
also performed using the same dataset and thresholds, but a
new window is created in every 10 minutes transactions
processing.
Not all itemsets are generated from each transaction, but only
itemsets with length ≤ 3 items. The reason is actually made
based on characteristic of FP, where the longer itemsets usually
have smaller support [16]. It means that short itemsets have
higher probability to appear frequently hence emerging quickly.
Fig. 7. Number of EP found by LWin, FWin, and PF-FWin per 10K
In addition, since we perform online data stream, memory
Transactions
resources must not also be wasted by filling it with long
itemsets with small possibility to be frequent and emerging in
The worst case of PF-FWIn is occurred when all sup[i].vol
short term.
reached their capacity, or when volume sequence = a Fibonacci
sequence because the array only contains two 1s in its front
2. Results and Discussions
elements. Fig. 7 shows that LWin outperforms PF-FWin in two
windows i.e. N = 7 and 33. If elaborated, PF-FWin’s and
Number of EP found by three models is charted in Fig. 7.
LWin’s volume sequences in N = 7 are {1, 1, 2, 3} and {1, 1[1],
Totally there are 4,708 EPs found by PF-FWin, while FWin
2, 4} respectively; while for N = 33 is {1, 1, 2, 3, 5, 8, 13} and
and LWin only found 1,255 and 2,469 EPs respectively. It
{1, 1[1], 2, 4, 8, 16} for respective models. Number of EP
means that performance of PF-FWin to find EP is about 1.9
found by these models in N = 7 is 5 and 6 EP respectively, and
times and 3.75 times or about 90% and 275% higher than
in N = 13 is 58 and 66 EP respectively. All sup[i].vol in PFLWin and Fwin, respectively.
FWin reached their capacity, thus the array only has two 1s in
The runtime performance does not show significant
its first elements. In contrast, array in LWin has three 1s and it
difference between three methods, whereas PF-FWin applies a
makes more itemsets for the mingrowth. Nevertheless, these
quite different approach compared to the traditional ones. PFinstances are good proof that having more 1s in array opens
FWin finished processing all 1M transactions by creating 100
opportunity for the algorithm to find more EP.
windows in 11H(our):27(M)inutes, while LWin and FWin
The best case for PF-FWin is reached after the volume
respectively spent 11H:48M and 11H:28M. Two methods of
sequence = Fibonacci sequence and push-front operations are
FWin are faster than LWin. In total, 814,895 different itemsets
performed thrice afterward. The reason of this situation is
in 100 windows or about 814.959 itemsets/window/10K
because many 1s are generated in front of such sequence. After
transactions are generated.
volume sequence = Fibonacci sequence, the first push-front
Compared to two other models, LWin took the longest time
operation creates a new element in front of array; the second
to finish all processes because it uses “double” geometric
push-front merges sup[n–2] into sup[n–1], thus sup[n–1] =
sequence to organize the array’s elements, thus number of
F[n–1]; and the third push-front updates sup[n–2] += sup[n–3].
elements that must be checked becomes doubled as well. Our
As the result, many 1s are then pushed into the front of
observation concluded that the time complexity worst case of
sequence which accordingly, support’s accuracy is improved
LWin is when all or almost all sup[i] and sub[i] are occupied,
and also becoming the best at this state. An instance for this

explanation can be gotten by comparing volume sequences
when N = 12 and N = 15.

Fig. 8.

Number of EP found by LWin, FWin, and PF-FWin per 10
Minutes

On the other side, result from another experiment also quite
similar with previous result (Fig. 8). Number of EP found by
PF-FWin, FWin and LWin respectively is 3,523, 686 and 1,962
EPs. In this case, FP-FWin also outperforms the other models
up to 5.13 and 1.79 times respectively.
First experiment approach shows that as number of
transactions in all windows is uniform, the number of itemsets
created and number of EP found in each window are constant.
However, those numbers are not the same for each window
when the second experiment approach is performed. This
phenomenon is happened more likely because of two reasons.
First, although the timer was set to 10 minutes (or other
duration), but the computer results for the time calculation have
milliseconds difference. It somehow affects to the number of
itemsets generated in each window as well. Second, such
itemsets’ difference affects to itemset’s opportunity to be
frequent and then emerging in next windows. It also becomes
the reason that at certain windows, number of EP found by
FWin and LWin is bigger than those found by PF-FWin.
Almost all windows created by both experiment approaches.
However, PF-FWin always outperforms the other two models
to find EP. If we review the GR’s formula, then an itemset X
can become an EP only if X.sup[j], hence X.sup[j].vol is not
that big. Among of three models, only PF-FWin can meet such
requirement because in certain circumstances, many Sup[i].vol
contain only one window. Therefore, even if  is made bigger
than 0.2, such as 0.5, PF-FWin still becomes the winner.
On the contrary, FWin produces the lowest number of EP,
because several elements are empty in some situations. For
example, when N = 14, volume sequence of PF-FWin, FWin
and LWin respectively is {1, 1, 1, 1, 2, 8}, {1, 0, 2, 3, 0, 8} and
{1, 1, 2[2], 4[4]}. While LWin has even two 1s in front of the
array, FWin only contains one 1; its sup[1].vol is even empty
whereas it should provide the most accurate support.
Accordingly, very little number of itemsets has opportunity to
be frequent and hence emerging in FWin. Adversely, PF-Fwin
provides not only better completeness, but also better accuracy
since many elements contain only one window.

V. Conclusion
The proposed PF-TTWM and its implementation shows that
PF-FWin has superiority to find more EP and runtime
performance in online data stream environment compared to
LWin and FWin. This work also suggests that data
completeness and accuracy are two quality characteristics that
must be provided by tilted time-windows approaches. As
shown in experiment, PF-FWin has better data completeness
and accuracy to makes this model can find more EP efficiently
than two other models. FP-FWin has been proven in this work
as the best option to find EP or predicting the trend of patterns’
popularity in online data stream.

Acknowledgment
This work was supported by the National Natural Science
Foundation of China under the grant No.61171173.

References
[1]

[2]
[3]

[4]

[5]

[6]

[7]

[8]

[9]

B. Wixom et al., “The current state of business intelligence in
academia: The arrival of big data”, Communications of the
Association for Information Systems, Vol. 34 (Article 1) ,2014
A. Balliu et al., “A Big Data analyzer for large trace logs”,
Computing, Vol. 98, No. 12, 2015, pp.1225–1249
A. Gandomi, A. Haider, “Beyond the hype: Big data concepts,
methods, and analytics”, Infor. Management, Int. Jour. Of,
Elsevier, Vol. 35, 2015, pp. 137 – 144.
M. S. Khan, et al., “A sliding windows based dual support
framework for discovering emerging trends from temporal data,”
Knowledge-based System, International Journal on, Vol. 10,
2010, pp.316 – 322
G. Dong, J. Li, “Efficient mining of emerging patterns:
discovering trends and differences,” KDD, ACM International
Conference on, 1999, pp. 43-52
G. Cormode, S. Muthukrishnan, “What's new: finding
significant differences in network data streams”, IEEE/ACM
Trans. Netw. 13(6), 2005, pp.1219-1232
H. Alhammady, K. Ramamohanarao, “Mining Emerging
Patterns and Classification in Data Streams,” IEEE/WIC/ACM
Int’l Conference on Web Intelligence, 2005, pp.272-275
Y. Chen, et al., “Multidimensional regression analysis of timeseries data streams”, In Proc. 2002 Int. Conf. Very Large Data
Bases (VLDB'02), 2002, pp.323.334.
V.E. Lee, R. Jin, G. Agrawal, “Frequent pattern mining in data
streams,” in Frequent Pattern Mining, C. C. Aggarwal, J.W. Han
(eds.), Springer, 2014, pp.199–224.

[10] C. Giannella, et al., “Mining frequent patterns in data streams at
multiple time granularities,” In: H. Kargupta, A. Joshi, D.
Sivakumar, Y. Yesha (eds) Data mining: next generation
challenges and future direction, MIT/AAI Press, 2004, pp.191–
212.
[11] T.M., Akhriza, Y.H. Ma, Y.H., J.H. Li, “A novel Fibonacci
windows model for finding emerging patterns over online data
stream”, Proc of IEEE SSIC, Shanghai, China, 2015, pp1-8.
[12] P. Kralj, et al., “Contrast Set Mining for Distinguishing Between
Similar Diseases,” LNCS, Volume 4594, 2007.
[13] J.Y. Li, et al., “Discovery of Significant Rules for Classifying
Cancer Diagnosis Data,” Bioinformatics, Vol. 19 (suppl. 2),
2003, pp.93-102
[14] J.Y. Li et al., “Simple Rules Underlying Gene Expression
Profiles of More than Six Subtypes of Acute Lymphoblastic
Leukemia (ALL) Patients”, Bioinformatics, Vol. 19, 2003,
pp.71-78
[15] L.J Chen, G.Z. Dong, “Masquerader Detection Using OCLEP:
One Class Classification Using Length Statistics of Emerging
Patterns”, Int’l Workshop on INformation Processing over
Evolving Networks (WINPEN), 2006
[16] C. Borgelt, “Frequent Item Set Mining in Wiley Interdisciplinary
Reviews: Data Mining and Knowledge Discovery”,
J. Wiley & Sons, Vol. 2 No. 6, 2012, pp.437-456.
[17] C. Lee, C. Lin, M. Chen, “Sliding-window filtering: an efficient
algorithm for incremental mining,” Information and knowledge
management (CIKM), ACM Int’l conf. on, 2001, pp.263–270
[18] H-f. Li, S-Y. Lee, “Mining frequent itemsets over data streams
using efficient window sliding techniques”, Expert system with
application, Journal of, Vol. 36, 2009, pp1466-1477.
[19] J.H. Chang, W.S. Lee, “Finding recent frequent itemsets
adaptively over online data streams,” In: L. Getoor et al. (eds)
Ninth ACM SIGKDD Int’l conf. on know. Disc. and data
mining, Washington DC, August, 2003, pp.487 – 492
[20] J.H. Chang, W.S. Lee, “estWin: Online data stream mining of
recent frequent itemsets by sliding window method,”
Information Science, Journal of, Vol. 31, No. 2, 2005, pp.76–90
[21] G.S. Manku, R. Motwani, R., “Approximate frequency counts
over data streams,” Proc. Of 28th international conference on
very large data bases, Hong Kong, 2002, pp.346–357
[22] J. Cheng, Y. Ke, Y., W. Ng, “A survey on algorithms for mining
frequent itemsets over data streams.” Knowl Inf Syst, Journal of,
Vol. 16, 2008, pp.1–27