M02115

2015 IEEE 39th Annual International Computers, Software & Applications Conference

Dynamic Query Scheduling for
Online Integration of Semistructured Data
Handoko

Janusz R. Getta

School of Computer Science and Software Engineering
University of Wollongong, Australia
Email: h629@uowmail.edu.au

School of Computer Science and Software Engineering
University of Wollongong, Australia
Email: jrg@uow.edu.au
form of data containers. A set of increment expressions are
generated from data integration expressions such that every
data container is assigned to an increment expression. Based
on increment expressions generated earlier, online integration
plans are produced, and processing increment data is performed through evaluation of selected online integration plan.


Abstract—In data integration systems a user request issued
at a central site is decomposed into a number of sub-requests,
which later on are processed at the remote sites. The results are
sent back to a central site for data integration and the results of
integration are returned to a user. Data integration systems often
failed to show its best performance due to unpredictable data
arrival rate. Traditionally, data integration requires the complete
results from the remote sites to be available at a central site before
final computations begin. An online integration system starts and
continues the computations at the central site shortly after every
piece of data is received from the remote sites.

Many works and algorithms have been proposed to increase
performance of online integration regarding to unpredictable
data arrival rate. Most of the works answer the problem at one
or some of three levels, i.e. operator, scheduling, and query
execution plan level [1]. This paper proposes a mixed dynamic
strategy to response this problem as follow:

Execution of online integration plan in static scheduling

strategy causes poor performance of data integration system as
unnecessary computations are executed in some circumstances.
This paper proposes a dynamic scheduling for online integration
plans, which employs data increment monitoring system which is
able to dynamically change the data integration plans whenever
it is necessary.

• we propose an online integration system to compute every
single increment data received at a central site as soon as it is
arrived. We employ an algebraic system for semistructured
data which has enough properties to generate an increment
expression for every data container in a data integration
expression.

Keywords—Online data integration, dynamic query scheduling,
semistructured data

I.

• we propose a system to reduce expensive IO cost of materialization updates by priority execution of data arrivals and

materialization management.

I NTRODUCTION

Data integration systems employs classical distributed
query processing where a query execution plan (QEP) is
generated at compile time and query engine executes the
query at runtime. Although this approach has been proven,
integration processing at the central site can create a bad
performance because the central site does not have enough
information about remote site’s behavior. The complexity of
sub-query sent to remote sites, the load of the particular remote
site and the characteristic of network are responsible to the data
arrival rate at the central site. In addition, the characteristic of
sub-query results are difficult to asses [1]. Moreover, evaluation
of a data integration after all data available at the central site
causes a slow response to a data integration system.

The structure of this paper is as follows: section 2 describes
previous work. Section 3 and 4 covers online integration

system architecture and scheduling of online integration plan.
Then, we discuss evaluation of online integration plans in
section 5, and section 6 concludes the paper.
II.

Data integration systems for semistructured data require
data model and algebra that allow for efficient processing of
semistructured structures. XAnswer and XAL systems provide
operators on relational-like data structures which require transformation process from tree structure to relational model and
vice versa. TAX(Tree algebra for XML) is based on tree view
of XML documents [2]. XTreeGraph [3] system handles a set
of overlapping trees. XML algebra proposed in [2] is a treebased algebra generalizing the relational algebra.

Online integration is a continuous process to combine
data transmitted over a network with data which are already
available at a central site. Online integration applies online
processing to instantly compute the smallest unit of increment
data without waiting for entire set of data available. Then, the
result of computation is combined with the current result to
obtain a new state of processing.


Data integration systems can be classified into two groups
i.e. materialized view (data warehousing) and virtual view (virtualization). Viyanon [4] proposed a data integration systems
for of XML Data by detecting the similarity of sub trees.
Whereas, Poggi in [5] proposed integration system based on
an identification of nodes.

In online integration systems, user requests are transformed
into global query expressions. After decomposition of global
query expressions, data integration expressions are generated,
where results from remote sites become their arguments in the
0730-3157/15 $31.00 © 2015 IEEE
DOI 10.1109/COMPSAC.2015.248

P REVIOUS W ORK

375

In order to overcome critical and time consuming task of
integration systems for DW and BI, researchers put their efforts

to propose many algorithms and techniques. Fegaras [6] and
Bonifati [7] proposed systems for incremental maintenance
of XML view. Meanwhile, a system to maintain materialized
XQuery view by performing incremental update to gain a
better access to data sources is proposed in [8].
Dynamic scheduling strategy proposed in [9] is to modify the query plan whenever unexpected delay (initial delay,
burstly arrival, and slow delivery) occurs at any data sources.
Bouganim [1] employs monitoring arrival rate and available
memory in the execution strategy of integration system.
III.

Fig. 2. (a) A syntax tree of a global query expression and decomposition
strategy to balance central and remote site processing (b) A data integration
expression (c) Increment expression for increment data δ1

(1)
(2)
(3)
(4)
(5)


O NLINE I NTEGRATION S YSTEM A RCHITECTURE

In this paper an online integration system contains a
mediator and a number of wrappers (see Fig. 1). A mediator
transforms a user query into a global query expression and
apply standard syntax-based optimisation, e.g. moving filtering
before binary operations. Restructuring operation (π) is pushed
to the top of a global query expression.

Filtering: σϕ (r ∪ s) = σϕ (r) ∪ σϕ (s);
Union: (r ∪ s) ∪ t = (r ∪ s) ∪ t = r ∪ s ∪ t;
Join: (r ∪ s) ⊲⊳ρ,ϕ t = (r ⊲⊳ρ,ϕ t) ∪ (s ⊲⊳ρ,ϕ t);
Antijoin: (r ∪ s) ∼ϕ t = (r ∼ϕ t) ∪ (s ∼ϕ t), and
r ∼ϕ (s ∪ t) = (r ∼ϕ s) ∼ϕ ((r ∼ϕ s) ⋉ϕ t).

When data containers r, s, and t share common tree pattern
to satisfy propositional condition of operations, then distributivity of antijoin operation over join, union and antijoin can
be determined as follows:
(6) (r ∼ϕ s) ⊲⊳ρ,ϕ t = (r ⊲⊳ρ,ϕ t) ∼ϕ (s ⊲⊳ρ,ϕ t);

(7) r ⊲⊳ρ,ϕ (s ∼ϕ t) = (r ⊲⊳ρ,ϕ s) ∼ϕ (r ⊲⊳ρ,ϕ (s ⋉ϕ t));
(8) (r ∼ϕ s) ∪ t = (r ∪ t) ∼ϕ (s ∼ϕ t);
(9) r ∪ (s ∼ϕ t) = (r ∪ s) ∼ϕ (t ∼ϕ (r ⋉ϕ s));
(10) (r ∼ϕ s) ∼ϕ t = r ∼ϕ s ∼ϕ t;
(11) r ∼ϕ (s ∼ϕ t) = (r ∼ϕ s) ∪ (r ⋉ϕ (s ⋉ϕ t))
B. Data Integration Expression
In order to achieve an optimal integration solution, the
global query expression is decomposed into a number of subexpressions where remote sites do part of computations, and
the central site integrates their results to produce a final result.
Borrow from [10], query decomposition and data integration
expression is formally defined in definition 2 and 3.

Fig. 1.

Definition 2: Query decomposition is as a process, that
transforms a global query expression e(x1 , . . . , xn ) into an expression f (q1 , . . . , qk ) where for all i = 1, . . . , k, qi = ei (xi1 ,
. . . , xij ), {xi1 , . . . , xij } ⊆ {x1 , . . . , xn }, xi1 , . . . , xij point
to the same remote site, and qi , qj might be sent to the same
remote site. Results of processing of f (q1 , . . . , qk ) are identical
with the results of processing of e(x1 , . . . , xn ).


Online integration system architecture

Definition 1: Let {x1 , ..., xn } be a set of pointers to the
data containers with XML documents located at remote sites.
A global query expression e(x1 , ..., xn ) is an expression built
from the operations of filtering (σ), join (⊲⊳), semijoin, antijoin
(∼) and union (∪) on {x1 , ..., xn }[10].

Definition 3: Let f (q1 , ..., qk ) be a result of decomposition
of e(x1 , . . . , xn ). Let Di be a result of processing qi at a
remote site for i = 1, . . . , k. A data integration expression
f (D1 , ..., Dk ) is an expression obtained from f (q1 , . . . , qk )
by a systematic replacement of the symbols q1 , . . . , qk with
the data container D1 , . . . , Dk .

A. XML Algebra Rules
Online integration system operates on an algebraic system
which is defined in [10]. The operators are conceptually
consistent with the basic operators of relational data model.

If we simplify each document in a data container such that
it represents a row in a relational table, then the system of
operations reduce to relational algebra.

Fig. 2a shows an example of a global query expression
e(x1 , . . . , x6 ) where {x1 , . . . , x4 } are located at a single
remote site S1 , x5 is at a remote site S2 , and x6 is at a remote
site S3 . Suppose decomposition strategy for subtree rooted at
node labeled 1 is such that nodes 1 and 2 are processed at the
central site, and a subtree rooted at a node 3 is computed at
remote site S1 (see Fig. 2b). The resulting computations are
transferred to a central site, and central site collects all data
received in data containers (D) for further processing.

XML algebra operators possess common associativity and
distributivity properties as in a relational algebra model. Distributivity properties over union operation are as follows:

376

For example, let D1 , . . . , D5 be the data containers in

an expression f (D1 , . . . , D5 ) = ((D1 ⊲⊳ D2 ) ⊲⊳ D3 ) ⊲⊳
(D4 ∼ D5 ), and δ1 is an increment of data container D1 .
Transformation of data containers produces:

To simplify the idea of creating increment expression, we
assume that a data container Di exists in a single occurrence
in a data integration expression.
C. Increment Expression

δ1
δ2
δ3
δ4
δ5

Data integration expression generated from the earlier step
must be transformed into a form to allow instant computation
when an increment of a data container arrives at the central
site. We consider a data increment (δ) as a complete XML
document and leave increment of fragmented document for
the future work. Let δij be an increment of a data container
Di , then a data container is formed as Di = δi1 ∪δi2 ∪. . .∪δin .
In this work a data integration operation is a union operation.

Therefore we get g1 = (((δ1 ⊲⊳ D2 ) ⊲⊳ D3 ) ⊲⊳ M3 ) where
M3 = (D4 ∼ D5 ), g2 = (((D1 ⊲⊳ δ2 ) ⊲⊳ D3 ) ⊲⊳ M3 ), g3 =
((M1 ⊲⊳ δ3 ) ⊲⊳ M3 ), g4 = (M2 ⊲⊳ (δ4 ∼ D5 )) and g5 = δ5
where M1 = (D1 ⊲⊳ D2 ), M2 = (M1 ⊲⊳ D3 ).

Definition 4: Let Di be a data container, δi be an increment data of Di , and Ma = ha (D1 , . . . , Dk ) : a =
1, . . . , j be a set of intermediate materializations. Increment
expression gi (δi , M1 , . . . , Mj ) is an expression to compute
an increment data (δi ) against intermediate materializations.
gi has a form of left/right deep expression such that gi =
gij (. . . (gi2 (gi1 (δi , M1 ), M2 ), . . .), Mj ). gij operates on XML
operators, i.e. join, antijoin, semijoin and union. In a special
case, Ma can be an identity function of any data container Da .

Figure 2(c) shows an example of increment processing δ1
in a form of a syntax tree. Results of processing an increment
expression g1 is combined with previous final materialization
f (D1 , D2 , D3 , D4 , D5 ) to get a new state of processing.
At the end of transformation process we get a number of
increment expressions {g1 , . . . , gk } and a list of intermediate
materializations {M1 , . . . , Mj } to compute increment data.
Since an intermediate materialization (Mi ) is a result of
computations of a data integration expression ha (D1 , . . . , Dk ),
then updating an intermediate materializations is performed in
exactly the same way.

Theorem 1: Any data integration expression f (D1 ,
. . . , Di ∪ δi , . . . , Dk ) can be always transformed into one of
equivalent expressions:
f (D1 , . . . , Di , . . . , Dk ) ∪ gi (δi , M1 , . . . , Mj )
f (D1 , . . . , Di , . . . , Dk )∼gi (δi , M1 , . . . , Mj )

: f (D1 , D2 , D3 , D4 , D5 ) ∪ (((δ1 ⊲⊳ D2 ) ⊲⊳ D3 ) ⊲⊳ M3 );
: f (D1 , D2 , D3 , D4 , D5 ) ∪ (((D1 ⊲⊳ δ2 ) ⊲⊳ D3 ) ⊲⊳ M3 );
: f (D1 , D2 , D3 , D4 , D5 ) ∪ ((M1 ⊲⊳ δ3 ) ⊲⊳ M3 );
: f (D1 , D2 , D3 , D4 , D5 ) ∪ (M2 ⊲⊳ (δ4 ∼ D5 ));
: f (D1 , D2 , D3 , D4 , D5 ) ∼ δ5 .

(1)
(2)

D. Online Integration Plans
In the next step, an online integration plan is generated for
every increment expression from the earlier stage.

Proof: For k = 1, an expression f operates on a unary
operator. Then according to the rule (1), f (D1 ∪ δ1 ) =
f (D1 ) ∪ g(δi ) is true; For k = 2, an expression f is
operated on binary operators. Then according rules (2)-(3),
transformation of f (D1 , Di ∪ δi ) into f (D1 , Di ) ∪ g(δi , D1 )
or f (D1 , Di ) ∼ g(δi , D1 ) is true; Assume for k data containers
f (D1 , . . . , Di ∪δi , . . . , Dk ) can be transformed into expression
(1) or (2). Let f (D1 , . . . , Di ∪ δi , . . . , Dk+1 ) be an extension
of f (D1 , . . . , Di ∪ δi , . . . , Dk ) with a data container Dk+1 by
operator of ∪, ⊲⊳, or ∼.

Definition 5: Let Di be a data container, Ma : a = 1 . . . j
be a set of materializations. gix : x = 1, . . . j is a series of
computation for increment expression gi (δi , M1 , . . . , Mj ), and
di is a plan to compute gi . ps : s = 1, . . . , m is a sequence
of steps where in each step a simple expression is evaluated.
Each ps might be associated with either gix , a data container
update, or a materialization update. An online integration plan
of Di is defined as di : p1 ; . . . ; pm where ps : s = 1, . . . , m is
evaluated to accomplish increment operation by passing result
from one to the next step.

We extend an expression f (D1 , . . . , Di ∪ δi , . . . , Dk ) with a
union (∪) operator where f (D1 , . . . , Di ∪ δi , . . . , Dk ) is the
first argument and Dk+1 is the second argument. The expression f (D1 , . . . , Di ∪ δi , . . . , Dk ) is transformed as follows:
=f (D1 , . . . , Di ∪ δi , . . . , Dk )∪Dk+1
=(f (D1 , . . . , Dk ) ∪ g(δi , M1 , . . . , Mj ))∪Dk+1 , (rule 2)
=(f (D1 , . . . , Dk )∪Dk+1 ) ∪ g(δi , M1 , . . . , Mj )
=f (D1 , . . . , Dk+1 ) ∪ g(δi , M1 , . . . , Mj )
Since Ma = ha (D1 , . . . , Dk ) : a = 1, . . . , j, we are able to
form Ma′ = h′a (D1 , . . . , Dk+1 ) : a = 1, . . . , j. Then:
=f (D1 , . . . , Dk+1 ) ∪ g(δi , M1′ , . . . , Mj′ )
The same steps can be applied to prove extension of
f (D1 , . . . , Di ∪ δi , . . . , Dk ) with a data container Dk+1 by
operator of ⊲⊳, and ∼.

An online integration plan di is generated in the following
steps: (1) step p1 is created to update data container Di . (2)
then, a step ps is created for every gix : x = 2, . . . , j of
the increment expression; (3) next, we add a step pm+1 to
combine the last result of the increment computation to the
final materialization (Me ); (4) last, we update all intermediate
materializations affected by performing the same procedure to
all Ma = ha (D1 , . . . , Dk ), but without data container and
materializations updates.
In example above, transformation of an increment expression g1 creates the following online integration plan:
d1 :D1 = (D1 ∪ δ1 ); δd1 = (δ1 ⊲⊳ D2 ); δd2 = (δd1 ⊲⊳ D3 );
δd3 = (δd2 ⊲⊳ M3 ); Me = (Me ∪ δd3 ); M1 = (M1 ∪ δd1 ).

Transformation of a data integration expression into an
increment expression is by an application of distributivity
properties, and performed in the following steps: (1) we replace
Di in data integration expression with Di ∪δi ; (2) then, we use
rules (1-11) explained earlier to move δi inside an expression
such that it forms an expression (1) or expression (2).

IV.

S CHEDULING OF O NLINE I NTEGRATION P LAN

At the compilation process, an integration controller prepares a set of online integration plans in which every data
container (Di ) is assigned to an integration plan.

377

data containers. When any data container is completed, the
proposed system will identify which materializations can be
excluded from materialization update process.

A. Static Scheduling
In static scheduling mode, a single increment received
at any data container triggers an execution of integration
plan assigned. An integration plan allows any increment data
received at the central site to be computed and results a
new final state of computation. A static scheduling strategy
is performed in the following ways:

In this paper we assume that the central site is equipped
with a single processor, to show the main idea of rescheduling
for online integration plan. Any parallel dynamic scheduling
algorithm which employs parallel processing such as pipelining, parallel processor, or multi threading is not covered in this
discussion, but can be extended by considering dependency
between steps in integration plans.

pick a plan for corresponding increment data
for each step in integration plan di do
load data to main memory if needed
perform an algebraic operation
if (step=materialization update)
perform writing to a secondary storage

The first consideration is to reduce the number of documents involve in the computation process. In online data
integration system an increment expression can be unioned (∪) or antijoin-ed (∼) with previous final materialization
(see expression 1 and 2), where processing of increment
data which has increment expression is in the second form
(∼) may potentially reduce the number of documents in the
computation results. By giving a higher priority in processing increments which have such form, we might potentially
increase the performance of online integration system. Using
the previous example, D5 is a data container which its increment data forms an increment expression in the second
form (f (D1 , D2 , D3 , D4 , D5 ) ∼ δ5 ). Meanwhile, the other
data containers possess increment expressions in the first form.
Then, all increment data from D5 will have the highest priority
to be executed.

Static scheduling strategy in an online integration plan
might create poor performance in several circumstances:
1) At the initial stage. At this stage most of data containers
are empty; therefore executing all steps in a corresponding
integration plan unlikely gives the correct answer for user
query. The most important step to be executed in this stage
is updating data container itself;
2) A sequence of increment at data containers. When some
increments occur at a single data container, update to
intermediate materializations are unnecessary as the corresponding materialization will not in use for next computation. Increment results are kept in a temporary list;
3) At the ending stage. At this stage most data containers are
completed. Then, updating intermediate materialization
can be ignored if we identify the particular materialization
will not participate in any further computation.

Then, we minimize materialization update by management
of increment processing. Every incoming increment data at a
central site is labeled to identify its associated data container.
Data containers with higher data arrival rate will get increment
data more often than the lower data arrival rate. Let δi and δj
be a sequence of increment data, where δi arrives at the central
site before δj . Two consecutive increments might have three
possible conditions as follows:

B. Dynamic Scheduling
Dynamic scheduling in this paper eliminates inefficiency
of static scheduling described earlier. We employ a monitoring
system (Fig. 1) to continuously collect behavior of increment
data, data container and materializations. The monitoring system contains of an increment queue, an integration controller,
a materialization dependency table, temporary increment lists,
and a data container state table.

1) Both increments (δi and δj ) occur at a single data container. For example, both increments are associated with
a data container D1 , where δi = δ11 and δj = δ12 . For
further discussion, it is named as sequence with Type 1.
2) Both increments occur at different data containers
(Di , Dj ), and the data containers form an expression of
an intermediate materialization ha (Di , Dj ). For example,
based on data integration expression in Figure 2 (c), δi
is an increment of data container D1 (e.g. δ11 ) and δj is
an increment of data container D2 (e.g. δ21 ). Intermediate
materialization M1 is computation result of an expression
ha (D1 , D2 ) = (D1 ⊲⊳ D2 ). This is increment in Type 2.
3) Both increments occur at different data containers
(Di , Dj ), and computation of increment data δj requires
an updated materialization which involves data container
Di . In the example Figure 2 (c), δi can be the increment
of data container D1 (e.g. δ11 ) and δj is increment at data
container D3 , D4 or D5 . If δj is at data container D3 (e.g.
δ31 ), then materialization M1 needs to be updated before
computation of increment δ31 is initiated. For further
discussion, it is named as Type 3.

An increment queue takes responsibility to manage documents received at the central site, and has a role to serialize
concurrent increments data. A materialization dependency
table keeps relationship between intermediate materializations
and data containers. The table contains information about
which materializations are affected by an increment of a data
container, and it determines which data containers includes a
particular intermediate materializations in its integration plan.
This table is created during the process to generate increment
expressions and integration plans.
An integration controller contains a collection of integration plans for every data container. As a central of dynamic
monitoring system, it utilizes all components to decide whether
to continue the current plan, skip some steps of the current
plan, or terminate the plan. Temporary increment lists are used
to keep temporary increment results which later on will be
combined to a materialization. A temporary increment list is
associated to each of materialization, and will be flushed to the
corresponding materialization whenever needed. Meanwhile, a
data container state table is used to determine the status of

We propose a dynamic scheduling based on sliding window
model. Increment data in a sliding window are labeled with
their priority and are sorted based on their priority. Priority
labeling and sorting works in the following steps:

378

2) δ51 : after updating D5 , the monitoring system identifies
that D4 is empty. Then, an early termination is performed;
3) δ41 : all prepared plan are executed including materialization update, because the next increment occurs at D1 ;
4) δ11 , δ12 : processing integration plan for increment δ11
and δ12 will perform all prepared steps, except materialization update as next consecutive increment is at D1 ;
5) δ21 : all prepared plan are executed;
6) δ31 : all prepared plan are executed except M2 update, as
D4 and D5 are completed and M2 will be unused.

1) we collect a set of increment data from a sliding window;
2) then, we give a highest priority for all data containers
where their increment expressions are in the second
form (see expression 2). Increment data from these data
containers will be scheduled at the first rows. If there
exists increment data from more than one of these data
containers, increment data that appears most often will
have higher priority;
3) next, we select increment data at data container that
appears most often among all data containers in the
current sliding window;
4) last, the next increment data is determined by the current
increment data. We choose the next increment data which
satisfies type 1, followed by increment data in type 2, and
the rest of increment data will have the least priority. We
repeat step 3 until all increment data in sliding window
are scheduled.

At the ending stage of online data integration, we consider
permanent termination of increment data. Permanent termination is a process to cancel all running plan and stop the
current online integration process when any new increment has
no impact to the final result. For example, let D4 and D5 in
Fig. 2c are completed and computation of (D4 ∼ D5 ) returns
nothing. Then, computation of any increment at D1 , D2 ,
and D3 produces nothing as the final result. In this case,
dynamic scheduling allows termination of the current plan,
and terminates current integration process.

For example, let a sequence of increment δ11 ← δ12 ←
δ21 ← δ31 ← δ41 ← δ51 arrive at the central site in
an integration expression as in Figure 2 (c). Suppose, all
increment data fit the size of a sliding window. Then, priority
labeling and sorting produce a sequence of increment data
δ51 ← δ41 ← δ11 ← δ12 ← δ21 ← δ31 .

An increment data which has been processed is removed
from the current sliding window. Then, dynamic scheduling
system allows sliding window to slide and collect a new
increment data. The process of labeling and sorting is repeated
with new increment data in a new sliding window. Sorting the
current sliding window with a new increment data will be much
easier because data in previous sliding window has been sorted.

Dynamic scheduling system proposed reduces materialization updates by termination of plan, and delay of plan. Termination of plan is removing unneeded steps of computation
when the current step has no impact to the rest of plan. For
an example, in online integration di , when computation of
(δ1 ⊲⊳ D2 ) result nothing, then the rest steps of plan di can
be eliminated, and plan di is terminated.

It may happen that a significant delay can exists between
arrivals of increment data. Dynamic scheduling performance
might fall to the static scheduling performance if all steps in an
integration plan are executed before a new increment arrives.
Dynamic scheduling might employs statistical information of
the previous increments to predict the next likely increment to
occur. Then, the decision to continue or defer materialization
update can be made. In this paper we assume that delay
between arrival of increment data is relatively small.

Delay of plan is employed to defer materialization updates
if the corresponding materialization is not involved in next
plan. By defer of plan, we collect the result of increments
in a list and perform materialization update only when it
is necessary. The delayed update will be flushed when the
corresponding materialization takes place in the next integration plan. Algorithm 1 shows a generic dynamic scheduling
algorithm.

V.
Algorithm 1: Dynamic scheduling system
1
2
3
4
5
6
7
8
9
10
11

12

E VALUATION OF O NLINE I NTEGRATION P LANS

Online integration performance is defined by decomposition strategy and online integration plans. Performance of
decomposition strategy is measured by processing cost at
central and remote resources, and communication cost to bring
data to the central site [10]. Meanwhile, performance of online
integration plans based on the utilization of materialization at
the central site where integration process is performed.

while (not empty queue) do
Get increment data from a sliding window;
Sort increment data based on their priority;
foreach (sorted increment data) do
Prepare online integration plan;
foreach (step pi in integration plan) do
if (pi is a materialization (Mi ) update & Mi is not used by
next plan) then
Defer step pi ;
else
Execute step pi ;
if (pi has no result) then
Terminate rest steps of current plan;
end
end
end
end
Slide to the next sliding window;
end

Performance of online integration plans is determined at a
central site and measured by a total cost function:
CT =

m


(CPj + CIj )

(3)

j=1

where CPj and CIj represent CPU cost and I/O cost of a single
step pj in an online integration plan respectively. Although
CPU time is typically ignored, it is important to measure CPj
because it includes pattern tree matching component, which
is used in most operations of XML algebra to identify nodes
of interest. If the number of nodes in an XML document is
denoted as |X| = m and the number of nodes in the pattern

Using a sequence of increment data described earlier,
dynamic scheduling will perform in the following steps:
1) every increment data is labeled and sorted by their priority, and gives us: δ51 ← δ41 ← δ11 ← δ12 ← δ21 ← δ31 ;

379

tree is |P | = n, then the worst case time complexity of pattern
tree matching is O(m ∗ n).

and I/O costs for accessing materialization. At the initial state
of integration, the ending state of integration, and consecutive
increments, executing all sequence of online integration plan
is unnecessary and cost ineffective. The system is able to stop
the plan when there is no increment result to pass to the next
operation, but enable to update intermediate materialization if
needed. Based on a tree-based algebra generalizing the relational algebra, this system is applicable to online integration
systems which based on relational model.

Step ps mostly operate on an increment data against a data
container or materialization. After processing online integration in a period of time, the size of materializations grows and
significantly larger than increment data. Increment data and
data containers are fit enough to reside in transient memory,
whereas materializations which normally have a huge size, will
reside in a secondary storage. For step ps that compute an
increment data against a data container, no I/O cost is needed
because no access to a secondary storage needed.

During a decomposition process, it may happen that a data
container Di shows as multiple arguments in a data integration
expression. In this case, increment processing is performed
for each argument Di serially, such that one argument Di is
computed at a time, and other Di arguments are considered as
different data containers. It is a challenge to work on increment
of document fragments in our work.

For step ps which compute an increment data against
a materialization, I/O cost to read materializations from a
secondary storage is the main cost component. CPU cost can
be ignored because it is much smaller than I/O cost. Since
processing materialization data is critical, implementation of
physical XML algebra operations is an important factor. In
general, step ps operate on two arguments where an argument
(increment) is smaller than the other (materialization). In
such condition when increment data fits in the main memory,
implementation of block-nested loop for binary operations
of XML algebra is preferable. Block-nested loop allows us
to force the increment data as the outer loop, and then it
is computed with a block of materialized document as the
inner loop. In a few cases where increment getting larger
(e.g. ps = Mi ∼ δ), implementation of block-nested loop
significantly increase I/O cost if materialization and increment
results do not fit in the transient memory, and a secondary
storage is employed to store the increment results.

VII.

This work is supported by Directorate General of Higher
Education (Dikti), Indonesian Ministry of National Education.
R EFERENCES
[1] L. Bouganim, F. Fabret, C. Mohan, and P. Valduriez, “Dynamic query
scheduling in data integration systems,” in Data Engineering, 2000.
Proceedings. 16th International Conference on, 2000, pp. 425–434.
[2] Handoko and J. R. Getta, “An XML algebra for online processing of
XML documents,” in The 15th International Conference on Information
Integration and Web-based Applications & Services, IIWAS ’13, Vienna
- December 2-4, 2013.
[3] W. May, “Logic-based XML data integration: a semimaterializing approach,” Journal of Applied Logic, vol. 3,
no.
2,
2005,
pp.
271

307.
[Online].
Available:
http://www.sciencedirect.com/science/article/pii/S1570868304000618
[4] W. Viyanon, S. Madria, and S. Bhowmick, “XML data integration
based on content and structure similarity using keys,” in On the
Move to Meaningful Internet Systems: OTM 2008, ser. Lecture Notes
in Computer Science. Springer Berlin Heidelberg, 2008, vol. 5331,
pp. 484–493. [Online]. Available: http://dx.doi.org/10.1007/978-3-54088871-0 35
[5] A. Poggi and S. Abiteboul, “XML data integration with identification,”
in Database Programming Languages, ser. Lecture Notes in Computer
Science. Springer Berlin Heidelberg, 2005, vol. 3774, pp. 106–121.
[Online]. Available: http://dx.doi.org/10.1007/11601524 7
[6] L. Fegaras, “Incremental maintenance of materialized XML views,”
in Database and Expert Systems Applications, ser. Lecture Notes in
Computer Science, A. Hameurlain, S. Liddle, K.-D. Schewe, and
X. Zhou, Eds. Springer Berlin Heidelberg, 2011, vol. 6861, pp. 17–32.
[Online]. Available: http://dx.doi.org/10.1007/978-3-642-23091-2 2
[7] A. Bonifati, M. Goodfellow, I. Manolescu, and D. Sileo, “Algebraic
incremental maintenance of XML views,” ACM Trans. Database
Syst., vol. 38, no. 3, Sep 2013, pp. 14:1–14:45. [Online]. Available:
http://doi.acm.org.ezproxy.uow.edu.au/10.1145/2508020.2508021
[8] M. EL-Sayed, L. Wang, L. Ding, and E. A. Rundensteiner, “An
algebraic approach for incremental maintenance of materialized
XQuery views,” in Proceedings of the 4th International Workshop
on Web Information and Data Management, ser. WIDM ’02. New
York, NY, USA: ACM, 2002, pp. 88–91. [Online]. Available:
http://doi.acm.org.ezproxy.uow.edu.au/10.1145/584931.584950
[9] T. Urhan, M. J. Franklin, and L. Amsaleg, “Costbased query scrambling for initial delays,” SIGMOD Rec.,
vol. 27, no. 2, Jun. 1998, pp. 130–141. [Online]. Available:
http://doi.acm.org.ezproxy.uow.edu.au/10.1145/276305.276317
[10] Handoko and J. R. Getta, “Query decomposition strategy for integration
of semistructured data,” in The 16th International Conference on Information Integration and Web-based Applications & Services, IIWAS ’14,
Hanoi - December 4-6, 2014.

The last type of ps is an operation to update materializations. At least one materialization update exists in an online
integration plan, which is updating a final materialization. If
a materialization update is performed using a union operation
(equation 1), then increment results (δj ) are appended to a
secondary storage where the materialization is stored. I/O cost
to update materialization by union operation is determined by
the number of block B(δdij ) to be appended to a secondary
storage. When materialization update is in the form of antijoin
operation (equation 2), then I/O cost is determined by the cost
model described earlier with an additional I/O cost to write
the increment results to a secondary storage.
I/O cost is generally more expensive than CPU cost,
therefore dynamic scheduling is employed to eliminate unnecessary materialization updates in order to minimize the
costs. Materialization update can be postponed by collecting
increment results to be integrated in a list. Whenever the
materialization (Mi ) is included in the next online integration
plan, operation to update materialization is triggered and Mi
will be updated with a corresponding list of increment results.
VI.

ACKNOWLEDGMENTS

C ONCLUSIONS AND F UTURE W ORK

Dynamic scheduling system proposed in this paper optimizes the online integration of semistructured data by removing unnecessary computations in the running phase. Increment
data arrived in an online integration system is assigned to
an online integration plan which is a sequence of simple
expression evaluations over increment data and materialization.
Effectiveness of an online integration plan is measured by a
cost model which involves CPU costs to compute operations,

380

Dokumen yang terkait

M02115

0 0 6