Load balancing distributed inverted file

Load Balancing Distributed Inverted Files
Mauricio Marin

Carlos Gomez

Yahoo! Research
Santiago, Chile
mmarin@yahoo-inc.com
cgomez@dcc.uchile.cl

ABSTRACT
This paper present a comparison of scheduling algorithms
applied to the context of load balancing the query traffic
on distributed inverted files. We implemented a number of
algorithms taken from the literature. We propose a novel
method to formulate the cost of query processing so that
these algorithms can be used to schedule queries onto processors. We avoid measuring load balance at the search engine side because this can lead to imprecise evaluation. Our
method is based on the simulation of a bulk-synchronous
parallel computer at the broker machine side. This simulation determines an optimal way of processing the queries
and provides a stable baseline upon which both the broker
and search engine can tune their operation in accordance

with the observed query traffic. We conclude that the simplest load balancing heuristics are good enough to achieve
efficient performance. Our method can be used in practice
by broker machines to schedule queries efficiently onto the
cluster processors of search engines.

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval—Search process

General Terms
Algorithms, Performance

Keywords
Inverted Files, Parallel and Distributed Computing

1.

INTRODUCTION

Cluster based search engines use distributed inverted files

[10] for dealing efficiently with high traffic of user queries.
An inverted file is composed of a vocabulary table and a set
of posting lists. The vocabulary table contains the set of
relevant terms found in the text collection. Each of these

Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
WIDM’07, November 9, 2007, Lisboa, Portugal.
Copyright 2007 ACM 978-1-59593-829-9/07/0011 ...$5.00.

terms is associated with a posting list which contains the
document identifiers where the term appears in the collection along with additional data used for ranking purposes.
To solve a query, it is necessary to get the set of documents
associated with the query terms and then perform a ranking
of these documents in order to select the top K documents
as the query answer.

The approach used by well-known Web search engines
to the parallelization of inverted files is pragmatic, namely
they use the document partitioned approach. Documents
are evenly distributed on P processors and an independent
inverted file is constructed for each of the P sets of documents. The disadvantage is that each user query has to
be sent to the P processors and it can present imbalance at
posting lists level (this increases disk access and interprocessor communication costs). The advantage is that document
partitioned indexes are easy to maintain since insertion of
new documents can be done locally and this locality is extremely convenient for the posting list intersection operations required to solve the queries (they come for free in
terms of communication costs). Intersection of posting lists
is necessary to determine the set of documents that contain
all of the terms present in a given user query.
Another competing approach is the term partitioned index
in which a single inverted file is constructed from the whole
text collection to then distribute evenly the terms with their
respective posting lists onto the processors. However, the
term partitioned inverted file destroys the possibility of computing intersections for free in terms of communication cost
and thereby one is compelled to use strategies such as smart
distribution of terms onto processors to increase locality for
most frequent terms (which can be detrimental for overall

load balance) and caching. However, it is not necessary to
broadcast queries to all processors (which reduces communication costs) and latency disk costs are smaller as they
are paid once per posting list retrieval per query, and it is
well-known that in current cluster technology it is faster to
transfer blocks of bytes through the interprocessors network
than from Ram to Disk. Nevertheless, the load balance is
sensitive to queries referring to particular terms with high
frequency, making it necessary to use posting lists caching
strategies to overcome imbalance in disk accesses.
Both strategies are efficient depending on the method used
to perform the final ranking of documents. In particular, the
term partitioned index is better suited for methods that do
not require performing posting list intersections. We have
observed that the balance of disk accesses, that is posting
list fetching, and document ranking are the most relevant

factors affecting the performance of query processing. The
balance of interprocessors communication depends on the
balance of these two components. From empirical evidence
we have observed that moderate imbalance in communication is not detrimental to performance, good balance in document ranking is always relevant whereas good balance in

disk accesses is crucial in ranking methods requiring intersection of posting lists.
When a given problem of size N is being solved on P processors, optimal load balance for many applications tends to
be achieved when there is sufficient slackness, namely N/P
is large enough. Slackness is also useful to hide overheads
and other inefficiencies such as poor scalability. Given the
size of the Web, current search engines are operating under
a huge slackness since the number of processors at data centers is similar to the average rate of queries per second. In
this context, document partitioned inverted files are working fine but this does not imply they make an efficient use of
computational resources and does not guarantee the same
scenario at larger query traffics demanding the use of more
and more processors. We have observed that even at large
slackness, sudden imbalance can produce unstable behavior
such as communication buffers saturation (which in our case
implies a program crash).
Some work has been done on the problem of load balancing query traffic on inverted files. They apply simple
heuristics such as the least loaded processor first or the
round-robin strategy [7]. However, from the literature on
parallel computing we can learn a number of more sophisticated strategies for static and dynamic scheduling of tasks
onto processors. It is not clear whether those strategies are
useful in this particular application of parallel computing,

namely whether they can have a significant impact in improving performance under high query traffic situations. Yet
previous work applies these heuristics at the search engine
side without considering the actual factors leading to imbalance. This because the scheduling decisions are not built
upon a cost model which be independent of the current operation of the search engine.
Queries arrive to the processors from a receptionist machine that we call the broker. In this paper we study the case
in which the broker is responsible for assigning the work to
the processors. Jobs badly scheduled onto the processors
can result in high imbalance. To this end the broker uses a
scheduling algorithm. For example, a simple approach is to
distribute the queries uniformly at random onto the processors in a blind manner, namely just as they arrive to the broker they are scheduled in a circular round-robin manner. A
more sophisticated scheduling strategy demands more computing power from the broker so in our view this cost should
by paid only if load balance improves significantly. Notice
that the proposals of this paper can be extended to the case
of two or more broker machines by simple composition.
The key point in the use of any scheduling algorithm is
the proper representation of the actual load imposed onto
the processors. For the distributed inverted files case we
need to properly represent the cost of disk accesses and document ranking and (very importantly) their relationship. A
method for this purpose is the main contribution of this paper along with an evaluation of the effectiveness of a number
of scheduling algorithms in the distributed inverted files context. Our method can be used by search engines to schedule

user queries efficiently which together with the scheduling

strategy found to be most efficient in this paper can be considered as a practical and new strategy for processing queries
in search engines.
Most implementations of distributed inverted files reported
so far are based on the message passing approach to parallel computing in which we can see combinations of multithreaded and computation/communication overlapped systems. Artifacts such as threads are potential sources of overheads and can produce unpredictable outcomes in terms of
running time. Yet another source of unpredictable behavior
are the accesses to disk used to retrieve the posting lists.
Namely, runs are too dependent on the particular state of
the machine and its fluctuations and thereby predicting current load balance to perform proper job scheduling can be
involved and inaccurate. An additional complication related
to measurement and control is that a corrective action in a
part of the system can affect the measures taken in another
part and this can be propagated circularly.
We think that what is needed is a way to measure load
balance that is independent of the current operation of the
search engine. We propose a precise and stable way to measure load balance and perform job scheduling. We do this
by letting the broker simulate a bulk-synchronous parallel
(BSP) computer [8] and take decisions based on its cost
model. It is known that this well-structured form of parallel computation allows a very precise evaluation of the

costs of computation and communication. For a fully asynchronous multithreaded search engine, the broker simulates
a BSP machine and takes decisions on where to route queries
whereas the processors also simulate a BSP machine in order to tune their operation (thread activity) to the pace
set by the BSP machine. These simulations are simple and
overheads are very low. Certainly the asynchronous search
engine can be replaced by an actual BSP search engine in
which case the simulation is an actual execution at the cluster side.
The BSP cost model provides an architecture independent way to both measure load balance and relate the costs
of posting list fetching and document ranking. It has been
shown elsewhere [8] that a BSP machine is able to simulate
any asynchronous computation to within small constant factors, so the simulated BSP computer is expected to work
efficiently if scheduling is effected properly. This provides a
stable setting for comparing scheduling algorithms.
The BSP model of parallel computing [9] is as follows.
In BSP the computation is organized as a sequence of supersteps. During a superstep, the processors may perform
computations on local data and/or send messages to other
processors. The messages are available for processing at
their destinations by the next superstep, and each superstep
is ended with the barrier synchronization of the processors.
The underlying communication library ensures that all messages are available at their destinations before starting the

next superstep.
The remainder of this paper is organized as follows. Section 2 presents a bulk-synchronous method for parallel query
processing and section 3 presents our approach to performing query scheduling and measuring load balance. Section
4 presents experimental results. Section 5 describes a particular case derived from the method in section 3 in which
ad-hoc load balance can be effected for the document partitioned inverted file. Section 6 presents conclusions.

2.

PARALLEL QUERY PROCESSING

The broker simulates the operation of a BSP search engine
that is processing the queries. The method employed by the
BSP machine is as follows. Query processing is divided in
“atoms” of size K, where K is the number of documents
presented to the user as part of the query answer. These
atoms are scheduled in a round-robin manner across supersteps and processors. The asynchronous tasks are given K
sized quanta of processor time, communication network and
disk accesses. These quanta are granted during superteps,
namely they are processed in a bulk-synchronous manner.
As all atoms are equally sized then the net effect is that no

particular task can restrain others from using the resources.
This because (i) computing the solution to a given query can
take the processing of several atoms, (ii) the search engine
can start the processing of a new query as soon as any query
is finished, and (iii) the processors are barrier synchronized
and all messages are delivered in their destinations at the
end of each superstep. It is not difficult to see that this
scheme is optimal provided that we find an “atom” packing
strategy that produces optimal load balance and minimizes
the total number of supersteps required to complete a given
set of queries (this is directly related to the critical path for
the set of queries).
The simulation assumes that at the beginning of each superstep the processors get into their input message queues
both new queries placed there by the broker and messages
with pieces of posting lists related to the processing of queries
which arrived at previous supersteps. The processing of a
given query can take two or more supersteps to be completed. The processor in which a given query arrives is called
the ranker for that query since it is in this processor where
the associated document ranking is performed.
Every query is processed using two major steps: the first

one consists on fetching a K-sized piece of every posting
list involved in the query and sending them to the ranker
processor. In the second step, the ranker performs the actual
ranking of documents and, if necessary, it asks for additional
K-sized pieces of the posting lists in order to produce the
K best ranked documents that are passed to the broker as
the query results. We call this iterations. Thus the ranking
process can take one or more iterations to finish. In every
iteration a new piece of K pairs (doc id, frequency) from
posting lists are sent to the ranker for every term involved in
the query. At a given interval of time, the ranking of two or
more queries can take place in parallel at different processors
along with the fetching of K-sized pieces of posting lists
associated with new queries.

3.

SCHEDULING FRAMEWORK

The broker uses the BSP cost model to evaluate the cost
of its scheduling decisions. The cost of a BSP program is
the cumulative sum of the costs of its supersteps, and the
cost of each superstep is the sum of three quantities: w, h G
and L, where w is the maximum of the computations performed by each processor, h is the maximum of the messages
sent/received by each processor with each word costing G
units of running time, and L is the cost of barrier synchronizing the processors. The effect of the computer architecture
is included by the parameters G and L, which are increasing
functions of P . Like in communication, the average cost of
each access to disk can be represented by a parameter D.

The broker performs the scheduling maintaining two windows which account for the processors work-load through
the supersteps. One window is for the ranking operations effected per processor per superstep and the another is for the
posting list fetches from disk also per processor per superstep. In each window cell we keep the count of the number
of operations of each type.
To select the ranker processor for a given query we have
two phases. In the first one the cost of list fetches is reflected
in the window in accordance with the type of inverted file
(document or term partitioning). In the second step the
ranking of document is reflected in the window, every decision on where to perform the ranking has an impact in
the balance of computation and communication. It is here
where a task scheduling algorithm is employed. Each alternative is evaluated with the BSP cost model. The windows
reflect the history of previous queries and iterations, and
their effect is evaluated with the BSP cost model.
The optimization goal for the scheduling algorithms is as
follows. We set an upper limit to the total number of list
fetches allowed to take place in each processor and superstep
(we use 1.5 R where R is the average). Notice that we cannot change the processor where the list fetching is to take
place but we can defer it one or more supersteps to avoid
imbalance coming from these operations. This has an effect
in the other window since this also defers the ranking of
those documents which provides the combinations that the
scheduling algorithms are in charge to evaluate and select
from.
The optimization goal is to achieve an imbalance of about
15% in the document ranking operation. Imbalance is measured via efficiency, which for a measure X is defined by the
ratio average(X)/maximum(X) ≤ 1, over the P processors.
All this is based on the assumption that the broker is
able to predict the number of iterations demanded by every
query. This requires cooperation from the search engine. If
it is a BSP search engine then the solution is simple since the
broker can determine what queries were retained for further
iterations from the answers coming from the search engine.
For instance, for a query requiring just one iteration the answer must arrive in two supersteps of the search engine. To
this end, the answer messages indicate the current superstep
of the search engine and the broker update its own superstep
counter taking the maximum from these messages.
For fully multi-threaded asynchronous search engines it is
necessary to collect statistics at the broker side for queries
arriving from the search engine. Data for most frequent
query terms is kept cached whereas other terms are given 1
iteration initially. If the search engine is implemented using
the round-robin query processing scheme described in the
previous section, then the exact number of iterations per
query can be calculated for each query. For asynchronous
search engines using any other form of query processing, the
broker can predict how those computations could have been
done by a BSP search engine from the response times for
the queries sent to processing. This is effected as follows.
The broker predicts the operation of the hypothetical BSP
engine every Nq completed queries by assuming that Q =
q P new queries are received in each superstep. For this
period of ∆ units of time, the observed value of Q can be
estimated using the G/G/∞ queuing model. Let S be the
sum of the differences δq = [DepartureTime – ArrivalTime]
of queries, that is the sum of the intervals of time elapsed

between the arrival of the queries and the end of their complete processing. Then the average q for the period is given
by S/∆. This because the number of active servers in a
G/G/∞ model is defined as the ratio of the arrival rate of
events to the service rate of events (λ/µ). If n queries are
received by the processor during the interval ∆, then the
arrival rate is λ = n/∆ and the service rate is µ = n/S.
Then the total number of supersteps for the period is given
by Nq /Q and the average running time demanded by every
superstep is δs = ∆ Q / Nq , so that the number of iterations
for any query i is given by δqi /δs .

4.

EVALUATION OF SCHEDULING
ALGORITHMS

We studied the suitability of the proposed method by evaluating different scheduling algorithms using actual implementations of the document and term partitioned inverted
files. Our BSP windowing simulation based scheme allows
a comparison of well-known algorithms under the same conditions.
Under circular allocation of queries to processors, the load
balance problem is equivalent to the case of balls thrown
uniformly at random into a set of P baskets. As more balls
of size K are thrown into the baskets, it is more likely that
the baskets end up with a similar number of balls. In each
superstep the broker throws Q = q P new queries onto the
processors. However, it is not clear the effect of balls coming
from previous supersteps as a result of queries requiring two
or more iterations. The number of iterations depends on the
combined effect of the specific terms that are present in the
query and the respective lengths of the posting lists. This
makes a case for exploring the performance of well-known
scheduling algorithms for this context.
The scheduling algorithms evaluated are dynamic and static
ones [5, 6, 1, 3, 2, 4]. In some cases we even gave them the
advantage of exploring the complete query log used in our
experiment in order to formulate a schedule of the queries.
In others we allowed them to take batches of q P queries to
make the scheduling. We define makespan as the maximum
amount of work performed by any processor.
The algorithms evaluated are the following: [A1] Roundrobin, namely distribute circularly the queries onto the processors; [A2] Graham algorithm (least loaded processor),
every time a task j arrives, we select the least load processor as the ranker; [A3] Limit algorithm, every time a task
j arrives, the makespan is computed as the maximum work
load in each machine. Then a processor i is selected in a
way to minimize the difference between the makespan and
the work load of the processor i plus the running time of
the new task j, if there is no such machine, the least loaded
processor is selected; [A4] Optimal limit algorithm, like A3
but in this case the limit is S(L(i))/P where L(i) is the work
load of the processor i, and S is the sum over all processors;
[A5] LPT (longest processing time), the tasks are ordered
by decreasing order of their processing time and then execute the LLP in that order; [A6] FFD (first-fit decreasing),
the optimal makespan is set as we mentioned before. The
tasks are ordered in decreasing order and are assigned in
that order. Each task is assigned to a processor avoiding to
exceed the optimal makespan. If there is not such machine,
the task is assigned in the LLP way; [A7] BFD (best-fit decreasing), like FFD but the task is assigned to a processor

where the difference between the optimal makespan and the
work load in that processor plus the running time of the
task, is minimum. If there is no such machine, the LLP
strategy is performed.
The results were obtained using a 12GB sample of the
Chilean Web taken from the www.todocl.cl search engine.
We also used a smaller sample of 2GB to increase the effect
of imbalance. Queries were selected at random from a set of
127,000 queries taken from the todocl log and iterations for
every query finished with the top K= 1024 documents. The
experiments where performed on a 32-nodes cluster with
dual processors (2.8 GHz). We used BSP, MPI and PVM
realizations of the document and term partitioned inverted
files. The scheduling algorithms were executed to generate the queries injected in each processor during the runs.
The BSP search engine does not perform any load balance
strategy so that its performance relies on the proper assignment of queries to its processors. In every run we process
10,000 queries in each processor. That is the total number
of queries processed in each experiment reported below is
10, 000P . Thus running times are expected to grow with P
due to the O(log P ) effect of the inter-processors communication network.
The following figures show running times for the BSP realizations of the document and term partitioned inverted files.
These are results from actual executions on the 32-nodes
cluster. We used the BSP simulation at the broker machine
side to schedule the queries onto the processing nodes (processors). Below we compare predictions made from the BSP
simulations with the observations in the actual BSP search
engine.
The figures 1 and 2 show that the strategies A1, A2, A5,
A6 and A7 achieve similar performance. A3 and A4 show
poor performance. These are results for an actual query log
submitted by real users. In most cases queries contain one
or two terms. To see if the same holds for more terms per
query we artificially increased the number of terms by using composite queries obtained by packing together several
queries selected uniformly at random from the query log to
achieve an average of 9 terms per query. This can represent
a case in which queries are expanded by the search engine
to include related terms such as synonyms. The results are
presented in figures 3 and 4 and also show that A1, A2, A5,
A6 and A7 achieve similar performance.
The same is observed in figures figures 5 and 6 for composite queries containing 34 terms. Among all the algorithms
considered, A1 is the simplest to implement and its efficiency
is outstanding.
Notice that the performance of the document partitioned
index becomes very poor compared to the performance of
the term partitioned index for queries with large number of
terms. Its performance is highly degraded by the broadcast
operations required to copy larger queries in each processor
and its consequences such as increased disk activity. Also
running times for low query traffic indicated by q= 8 are
larger than the case of high query traffic q= 32. This is explained by the balls thrown into baskets situation described
in the first part of this section. Imbalance is larger with low
query traffic and it has a significant impact in running time
which the scheduling algorithms cannot solve completely.
However, the comparative performance among the scheduling algorithms remains unchanged.
On the other hand, for small number of processors all

4 Procs
8 Procs
16 Procs
32 Procs

300
250
200
150
q= 8

q= 32

800
Running Time (sec)

Running Time (sec)

350

100
A1 A2 A3 A4 A5 A6 A7

600
500
400
q= 8
200

A1 A2 A3 A4 A5 A6 A7

A1 A2 A3 A4 A5 A6 A7

250
200
150
q= 32

100

800
Running Time (sec)

Running Time (sec)

Figure 3: Document partitioned inverted index
(large number of terms per query).

4 Procs
8 Procs
16 Procs
32 Procs

q= 8

A1 A2 A3 A4 A5 A6 A7

Scheduling Algorithms

Figure 1: Document partitioned inverted index.

300

q= 32

300

Scheduling Algorithms

350

4 Procs
8 Procs
16 Procs
32 Procs

700

A1 A2 A3 A4 A5 A6 A7
A1 A2 A3 A4 A5 A6 A7
Scheduling Algorithms

4 Procs
8 Procs
16 Procs
32 Procs

700
600
500
400
300

q= 8

q= 32

200
A1 A2 A3 A4 A5 A6 A7

A1 A2 A3 A4 A5 A6 A7

Scheduling Algorithms

Figure 2: Term partitioned inverted index.

Figure 4: Term partitioned inverted index (large
number of terms per query).

2800

4 Procs
8 Procs
16 Procs
32 Procs

2600
Running Time (sec)

algorithms behave practically the same. This because the
slackness N/P (with P =4) is large enough and overall balance is good. In this case scheduling is almost unnecessary. However, for a relatively larger number of processors
the slackness is small and the scheduling algorithms have to
cope with a significant imbalance from the balls and baskets
problem.
Finally figure 7 shows results obtained using strategy A1
for two asynchronous message passing realizations of inverted files implemented using the message passing PVM
and MPI communication libraries. The results are similar
to the BSP inverted file which is clear evidence that the BSP
cost model is a good predictor of actual performance. We
used these asynchronous realizations of inverted files to test
our model to predict the execution of an associated BSP
search engine at the broker side. The results were an almost
perfect prediction of the number of supersteps and iterations
per query. Most time differences were below 1% with some
cases in which the difference increased to no more than 5%.
This because under steady state query traffic the G/G/∞
model is a remarkably precise predictor of the average number of queries per superstep.

2400
2200
2000
1800
1600
1400
1200

q= 8

q= 32

A1 A2 A3 A4 A5 A6 A7
A1 A2 A3 A4 A5 A6 A7
Scheduling Algorithms

Figure 5: Document partitioned inverted index
(very large number of terms per query).

4 Procs
8 Procs
16 Procs
32 Procs

Running Time (sec)

2600
2400
2200
2000
1800
1600
1400
1200

q= 8

q= 32

A1 A2 A3 A4 A5 A6 A7

A1 A2 A3 A4 A5 A6 A7

Scheduling Algorithms

Figure 6: Term partitioned inverted index (very
large number of terms per query).

5.

SCHEDULING FOR RESPONSE TIMES

The decomposition of the query processing task in the
so-called round-robin quanta of size K can be exploited at
the broker side to schedule ranking and disk access operations in a way that it gives more priority to queries requiring few iterations. The case shown in the previous sections
was intended to optimize query throughput. In this section we adapt our method to improving response time of
small queries. The difference here is that new queries are
injected in the current superstep only when the same number of current queries have finished processing. We describe
the method in the context of the document partitioned index.
We assume an index realization in which it is necessary
to perform the intersection of the posting lists for the query
terms and then perform a ranking of the top-K results obtained in each processor. In this case disk accesses and
intersections are well balanced across processors, and it is
necessary to determine the processor (ranker) in which the
ranking for a given query is effected. We propose a strategy
that both improves the load balance of the ranking process
and prevent small queries from being delayed by queries requiring a larger number of disk accesses.
The broker maintains a vocabulary table indicating the
total number of disk accesses required to retrieve the posting
list of each term. For an observed average arrival rate of Q
queries per unit time the broker can predict the operation of
a BSP computer and take decisions about where to schedule
ranking of queries.
During a simulated superstep the broker gives to each
query a chance to make a disk access to retrieve K pairs
(doc ids, freq) of their posting lists. In the case of terms
appearing in two or more queries the broker grants only one
access and cache the retrieved data to let this bucket of
size K be used by the other queries without requiring extra
disk accesses. Disk accesses are granted in a circular manner among the available queries until reaching a sufficient
number of rankings which ensures an efficiency above 0.8
for this process. Thus small queries pass quickly through
these rounds of disk accesses. The efficiency is calculated
considering that the rankings are scheduled with the “least
loaded processor first” load balancing heuristic [4].

Running Time (sec)

240

2800

220

2 GB

P

12 GB

32
200
180

16
8

160

4

140

q= 32 BSP

MPI

PVM

BSP

MPI

PVM

120
D T D T D T

D T D T D T

Distributed Inverted Files

Figure 7: Scheduling A1 on asynchronous inverted
files; D stands for the document partitioned index
and T for term partitioned index.
As described in the previous section, the efficiency is calculated as the ratio average(X)/maximum(X) where X is
the cost each ranking and the average and maximum values
are calculated considering the given distribution of rankings
onto the P processors (for simplicity we consider that is cost
is linear with the number of terms). Also in practice it is
convenient to wait for efficiency above 0.8 and a total number of rankings to be scheduled of (Q/4) P or more. This to
avoid an excessive increase in the number of supersteps.
The figure 8 shows efficiencies predicted by the broker by
different values of Q with P = 32 and figure 9 shows the actual efficiencies achieved by the respective BSP search engine
for the case Q = 32 and P = 4, 8, 16 and 32 processors. This
figure also shows the efficiencies of the fetch+intersection
and communication tasks. The figures show that the broker
is able to predict well the actual load balance of the document partitioned index. The resulting scheduling of queries
to rankers leads to an almost perfect load balance for Q=32
and P = 4, 8, 16 and 32.
The figure 10 shows the average response time for queries
requiring from 1 to 19 disk accesses for P = 32 and Q=
32. The curves Q/2 and Q/4 show the case in which the
broker waits for that number of pending rankings before
scheduling them onto the P processors. The curve labeled
R shows a case in which the broker waits until all Q queries
injected in the superstep get their ranking pending. As expected the case R gives the same average response time to all
queries independently of the number of disk accesses they
require. Also Q/4 provides better results confirming the
shorter queries are given more preference than the other
two cases.

6. CONCLUDING REMARKS
Surprisingly in our experiments we have observed that
the most sophisticated scheduling algorithms were not able
to beat the round-robin and the least loaded processor first
strategies. In some cases they produced a quite poor scheduling of queries to processors increasing overall running times
significantly. The proposed method for costing task assignment decisions provides a framework upon which the different scheduling algorithms can perform their tasks under ex-

1.2

1

1

0.8

0.8

0.6

Query Cost

Eficiency

1.2

Q=32
16
8
4

0.4

R
Q/2
Q/4

0.6
0.4

0.2

0.2

0

0
0

50

100

150

200

250

0

5

SStep

Efficiency

Figure 8: Predicted efficiencies in the round-robin
document partitioned index.

1.0

Fetch

1.0

Comm

1.0

Ranking

0

10

20
30
Supersteps

40

50

Figure 9: Actual efficiencies in the round-robin document partitioned index.
actly the same conditions. The BSP cost model ensures that
the cost evaluation is not disconnected from reality. Our
method separates the cost of list fetching from document
ranking which makes these two important factors affecting
the total running time independent each other in terms of
scheduling. This means that the broker does not need to be
concerned with the ratio disk to ranking costs when deciding
to which processor send a given query.
For search engines handling typical user queries the roundrobin strategy with limit R to the number of disk accesses
per processor per superstep is sufficient to achieve efficient
performance. For example, the figures 11 and 12 show the
near-optimal efficiencies achieved across supersteps by the
list fetching, communication and document ranking operations in the execution using A1 shown in figures 1 and
2. This explains the efficient performance produced by this
strategy since all critical performance metrics are well balanced. Our results also show that a bad scheduling strategy can indeed degrade performance significantly. The least
loaded processor first heuristic should work better for queries
with large number of terms but we did not observe this intuition for artificially enlarged queries. For large number of
terms per query we also observed that the bad performer
algorithms tend to equal the good ones for high traffic of
queries (q= 32) whereas for low traffic (q= 8) they become
even more inefficient than in the case of few terms per query.

10
Number of Fetches

15

20

Figure 10: Query cost in the round-robin document
partitioned index.
Our results also show that the proposed BSP simulation
method at the broker side is useful to reduce running times.
Notice that this method allows the scheduling of ranking
operations in a round-robin fashion but this is effected considering the limits to the number of disk accesses per processor per superstep. That is, the broker maintains several
supersteps of round-robin scheduled rankings, each at different stage of completeness depending on the new queries
arriving at the broker machine and the queries currently
under execution in the cluster processors. The upper limits to the number of disk operations ensure that this costly
process is kept well-balanced and round-robin scheduling ensures that the efficiency of the also costly ranking operations
is kept over 80%. We think that this strategy is fairly more
sophisticated than having a broker machine blindly scheduling queries in a circular manner onto the processors. In this
sense, this paper proposes a new and low cost strategy for
load balancing distributed inverted files. Our results show
that this strategy is able to cause efficient performance in
both bulk-synchronous and message-passsing search engines.

Acknowledgment: This paper has been partially funded
by Fondecyt project 1060776.

7. REFERENCES
[1] J. L. Bentley, D. S. Johnson, F. T. Leighton, C. C.
McGeoch, and L. A. McGeoch. Some unexpected
expected behavior results for bin packing. In STOC,
pages 279–288, 1984.
[2] O. J. Boxma. A probabilistic analysis of the lpt
scheduling rule. In Performance, pages 475–490, 1984.
[3] E. G. Coffman, Jr., M. R. Garey, and D. S. Johnson.
Approximation algorithms (ed. D. Hochbaum), chapter
Approximation algorithms for bin packing - a survey.
PWS, 1997.
[4] R. L. Graham. Bounds on multiprocessing timing
anomalies. SIAM Journal of Applied Mathematics,
17(2):416–429, 1969.

4

Efficiency

1.0

Fetch

32
4

1.0

Comm

32
4

1.0

Ranking

32
0

50

100

150

200
250
Supersteps

300

350

400

Efficiency

Figure 11: Document partitioned index: Efficiencies
in disk accesses, communication and ranking with yaxis values from 0 to 1.

1.0

Fetch

1.0

Comm

1.0

Ranking

0

50

100

150

200
250
Supersteps

300

350

400

Figure 12: Term partitioned index: Efficiencies in
disk accesses, communication and ranking with yaxis values from 0 to 1.

[5] D. S. Johnson, A. J. Demers, J. D. Ullman, M. R.
Garey, and R. L. Graham. Worst-case performance
bounds for simple one-dimensional packing
algorithms. SIAM J. Comput., 3(4):299–325, 1974.
[6] F. T. Leighton and P. W. Shor. Tight bounds for
minimax grid matching, with applications to the
average case analysis of algorithms. In STOC, pages
91–103, 1986.
[7] A. Moffat, W. Webber, and J. Zobel. Load balancing
for term-distributed parallel retrieval. 29th annual
international ACM SIGIR conference on Research and
development in information retrieval, pages 348–355,
2006.
[8] D. Skillicorn, J. Hill, and W. McColl. Questions and
answers about BSP. Technical Report PRG-TR-15-96,
Computing Laboratory, Oxford University, 1996. Also
in Journal of Scientific Programming, V.6 N.3, 1997.
[9] L. Valiant. A bridging model for parallel computation.
Comm. ACM, 33:103–111, Aug. 1990.
[10] J. Zobel and A. Moffat. Inverted files for text search
engines. ACM Computing Surveys, 38(2), 2006.