Parallel CBIR implementations with load

Parallel CBIR implementations with load balancing algorithms

José L. Bosque a,∗ , Oscar D. Robles a , Luis Pastor a , Angel Rodríguez b

a Dpto. de Informática, Estadística y Telemática, U. Rey Juan Carlos, C. Tulipán, s/n, 28933 Móstoles, Madrid, Spain b Dept. de Tecnología Fotónica, UPM, Campus de Montegancedo s/n, 28660 Boadilla del Monte, Spain

Received 22 December 2003; received in revised form 23 February 2005; accepted 7 April 2006 Available online 13 June 2006

Abstract

The purpose of content-based information retrieval (CBIR) systems is to retrieve, from real data stored in a database, information that is relevant to a query. When large volumes of data are considered, as it is very often the case with databases dealing with multimedia data, it may become necessary to look for parallel solutions in order to store and gain access to the available items in an efficient way.

Among the range of parallel options available nowadays, clusters stand out as flexible and cost effective solutions, although the fact that they are composed of a number of independent machines makes it easy for them to become heterogeneous. This paper describes a heterogeneous cluster-oriented CBIR implementation. First, the cluster solution is analyzed without load balancing, and then, a new load balancing algorithm for this version of the CBIR system is presented.

The load balancing algorithm described here is dynamic, distributed, global and highly scalable. Nodes are monitored through a load index which allows the estimation of their total amount of workload, as well as the global system state. Load balancing operations between pairs of nodes take place whenever a node finishes its job, resulting in a receptor-triggered scheme which minimizes the system’s communication overhead. Globally, the CBIR cluster implementation together with the load balancing algorithm can cope effectively with varying degrees of heterogeneity within the cluster; the experiments presented within the paper show the validity of the overall strategy.

Together, the CBIR implementation and the load balancing algorithm described in this paper span a new path for performant, cost effective CBIR systems which has not been explored before in the technical literature. © 2006 Elsevier Inc. All rights reserved.

Keywords: Parallel implementations; CBIR systems; Load balancing algorithms

1. Introduction

of this task depends heavily on the volume of data stored in the system. As usual, parallel solutions can be used to alleviate

The tremendous improvements experimented by computers this problem, given the fact that the search operations present in aspects such as price, processing power and mass storage

a large degree of data parallelism.

capabilities have resulted in an explosion of the amount of Distributed solutions on clusters offer a good cost/perfor- information available to people. But this same wealth makes

mance ratio to solve this problem, given their excellent scalabil-

ity, fault tolerance and flexibility attributes [37,6,4]. Also, this try to solve this problem by offering mechanisms for selecting

finding the “best” information a very hard task. CBIR 1 systems

architecture allows concurrent access to disks, considered as the data items which resemble most a specific query among

the main bottleneck in CBIR systems. Although homogeneous all the available information [12,34], although the complexity

clusters could also be considered for this applications, it is difficult to keep the homogeneity of this type of systems during

all of their life-cycle. Among the factors that affect their config-

Corresponding author.

uration stability we can mention the addition of new nodes or

E-mail addresses: [email protected] (J.L. Bosque), [email protected] (O.D. Robles), [email protected] (L. Pastor),

substitution of faulty ones, technological evolution factors, and

[email protected] (A. Rodríguez).

even exploitation aspects such as disk fragmentation, etc. In

1 Content-based information retrieval.

consequence, clusters present additional challenges, since they

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

can easily become heterogeneous, requiring load distributions

2. Previous work

that take into consideration each node’s computational features [33]. This way, one of the critical parameters to be fixed in

The technological development experimented during the last order to keep the efficiency high for this architectures is the

20 years has turned into a spectacular increase in the volume workload assigned to each of the cluster nodes. Even though

of data managed by information systems. This fact has lead to load balancing has received a considerable amount of interest,

the search for methods to automate the process of extracting it is still not definitely solved, particularly for heterogeneous

structured information from these systems [12,31]. The poten- systems [10,18,41,45]. Nevertheless, this problem is central for

tial importance of CBIR systems has been reflected in the va- minimizing the applications’ response time and optimizing the

riety of approaches taken while dealing with different aspects exploitation of resources, avoiding overloading some proces-

of CBIR systems. The multidisciplinary nature of this problem sors while others are idling.

has often resulted in partial advances that have been integrated This paper describes the architecture, implementation and

later on in new prototypes and commercial systems. For exam- performance achieved by a parallel CBIR system implemented

ple, it is possible to find research work that takes into consider- on a heterogeneous cluster that includes load balancing. The

ation man–machine interaction issues [32]; the users’ behavior flexibility of the architecture herein presented allows the dy-

from a psychological modeling standpoint [27]; multidimen- namical addition or removal of nodes from the cluster between

sional indexing techniques [5]; multimedia database manage- two user queries, achieving reconfigurability, scalability and an

ment system issues [19]; pattern recognition algorithms [17]; appreciable degree of fault tolerance. This approach allows a

multimedia signal processing [39]; object representation and dynamic management of specific databases that can be incor-

modeling techniques [21]; benchmarks for testing the perfor- porated to or removed from the CBIR system in function of the

mance of CBIR systems [16,24]; etc.

desired user query. The heterogeneity of the system is managed In any case, most of the research effort for CBIR sys- by a new dynamic and distributed load balancing algorithm,

tems has been focused on the search for powerful repre- introducing a new load index that takes into account the com-

sentation techniques for discriminating elements among the putational nodes capabilities and a more accurate measure of

global database. Although the data nature is a crucial factor their workload. The proposed method introduces a very small

to be taken into consideration, most often the final repre- system overhead when departing from a reasonably balanced

sentation is a feature vector 2 extracted from the raw data, starting point.

which reflects somehow its content. While dealing with 2D As mentioned before, the amount of data to be managed in

images, it is possible to find techniques using color, shape, or CBIR systems is so huge nowadays that it is almost manda-

texture-based primitives. Other techniques use spatial relation- tory to use parallelism in order to achieve a reasonable user re-

ships among the image components or a combination of the sponse times. Two alternatives were tested in a previous work:

above-mentioned approaches. For higher-dimensionality input

a shared-memory multiprocessor and a cluster [6]. Since the data, it is possible to find proposals dealing with 3D images cluster implementation has given better results, it seems ad-

or video sequences. Nowadays, one of the most promising visable to introduce load balancing strategies to improve the

research lines is to increase the abstraction level of the se- efficiency in heterogeneous clusters. The selected approach is

mantics associated to the primitives managed, representing based on a dynamic, distributed, global and highly scalable

high-level concepts derived from the images or the multimedia load balancing algorithm. An heterogeneous load index based

data.

on the number of running tasks and the computational power From the computational complexity point of view, CBIR sys- of each node is defined to determine the state of the nodes. The

tems are potentially expensive and have user response times algorithm automatically turns itself off in global overloading

growing with the ever-increasing sizes of the databases associ- or under-loading situations.

ated to them. One of the most common approaches followed to Together, the CBIR implementation and the load balancing

reach acceptable price/performance ratios has been to exploit algorithm described in this paper open a new path for perfor-

the algorithms’ inherent parallelism at implementation time. mant, cost effective CBIR systems which has not been explored

However, the novelty of CBIR systems hinders finding refer- before in the technical literature.

ences dealing with this aspect. Some contributions that can The rest of this article is organized as follows: Section 2

be cited are Zaki’s compilation [43], and the contributions of presents an overview of parallel CBIR systems and load bal-

Srakaew et al. [37] and Bosque et al. [6]. Another reason that ancing algorithms. Section 3 presents an analysis of a sequen-

has made difficult widespread parallel CBIR system develop- tial version of the CBIR algorithm and a brief description of its

ment is that prototype analysis demands a manual image clas- parallel implementation on a cluster (without load balancing).

sification stage that limits in practice the number of images Section 4 describes the distributed load balancing algorithm ap-

used in the tests. Nevertheless, the volume of data managed plied to the parallel CBIR system and Section 5 details its im-

by current DBs, and obviously those with multimedia infor- plementation on a heterogeneous cluster. Section 6 shows the

mation, will demand parallel optimizations for commercial im- tests performed in order to measure the improvement achieved

plementations of CBIR systems. In those cases, load balancing by the heterogeneous cluster version with load balancing and

operations preventing the coexistence of idling and overloaded the results achieved. Finally, Section 7 presents the conclusions and ongoing work.

2 Named signature or primitive.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

processors will be almost required, since total response times are usually considerably improved with the introduction of even simple load balancing approaches.

Load balancing techniques can be classified according to different criteria [8]. First, algorithms can be labeled as static or dynamic . Static methods perform workload distribution at compilation time, not taking into consideration the system state variations. Dynamic methods are able to redistribute workload among nodes at run time, depending on changes in the system state. The work of Rajagopalan et al. [28] and Obeloer et al. [25] are agent-based techniques. These are flexible and con- figurable approaches but the amount of resources needed for agent implementation is considerably large. Grosu et al. [15] present a very different cooperative approach to the load balancing problem, considering it as a game in which each cluster node is a player and must minimize its job execution time. Banicescu et al. propose a load balancing library for scientific applications on distributed memory architectures. The library integrates dynamic loop scheduling as an object migration pol- icy with the object migration mechanism provided by the data movement and control substrate which is extended with a mo- bile object layer [2].

Load balancing algorithms can also be classified as centralized or distributed. In the first case, there is a single central node in charge of keeping the system’s information updated, making decisions and actually performing the load balancing operations. In distributed methods, every node takes part in the load balancing operations; Zaki et al. [44] show that distributed algorithms yield better results than their centralized counterparts. Last, load balancing algorithms can be classified as global or local. In the first case, a global view of the system state is kept [10]. In the second case, nodes are arranged in sets or domains, and distribution decisions are made only within each domain [9,40]. Other approaches mix this taxonomy by combining several features that could be considered mutually exclusive, like the work of Ahmad and Ghafoor [1], where a semidistributed algorithm with a two level hierarchy is presented; their work focus on static networks where communication latency is very important and depends on node placement. In this type of networks, distributed algorithms may produce instability, scalability and bottleneck problems. The improvement of dynamic network technologies solves these problems with broadcast solutions and very low latencies. The technique proposed by

Ahmad and Ghafoor [1], although interesting, is not easily applicable to general, unrestricted distributed systems: it was developed for static network environments, where latency is de- pendent on node location and where broadcast operations are very costly in terms of system performance. Clusters, which in the present work appear as a very attractive option for CBIR systems in terms of cost/performance ratio, present very different communication features, and therefore, advise using a different approach.

Although a set of projects have been developed to implement CBIR systems on clusters like the IRMA project for medical images [14] and the DISCOVIR project (distributed content- based visual information retrieval system on peer-to-peer network) [13], none of them include a load balancing algorithm to

distribute the workload of the cluster nodes and therefore they cannot manage system heterogeneity.

3. CBIR system description

The experimental work presented in this paper has been performed on a test CBIR system containing information from

29.5 million color pictures. The system provides the user with

a data set containing the p images considered most similar to the query one. If the result does not satisfy the user, he/she can choose one of the selected images or enter a new one that presents some kind of similarity with the desired image.

The following sections describe the heart of the CBIR system, where the signature is extracted from each image (a feature vector describing the image content), as well as the processes involved in serving a user’s query. More detailed analysis of the retrieval techniques involved in the CBIR system and the method’s stages from the standpoint of parallel optimization can be found elsewhere [30,29,6], respectively.

3.1. Signature computation Many different approaches can be used for computing the

images’ signatures, as mentioned in Section 2. In the work presented here, a primitive that represents the color information of the original image at different resolution levels has been selected. To achieve a multiresolution representation, a wavelet transform is first applied to the image [22,11].

3.2. Analysis of the sequential CBIR algorithm The search for images contained in a CBIR system can be

broken down into the following stages: (1) Input/query image introduction: The user first selects a

128×128 pixel bidimensional image to be used as a search reference. Then the system computes its signature as described above. The whole process can be efficiently implemented using an O(i_s) order algorithm, i_s being the image’s size [38]. This stage does not require high computational resources since the system deals with just one image.

(2) Query and DB image’s signature comparison and sorting: The signature obtained in the previous stage is compared with all of the DB images’ signatures using an Euclidean distance-based metric. After this process, the identifiers of the p images most similar to the input image are extracted, ranked by their similarity. Even though this process of signature comparison, selection and ranking is not very demanding from the computational point of view, it has to

be performed with all of the images within the DB. (3) Results display: The following step is to assemble a mosaic made up of the selected p images which has to be presented to the user as the search result (see Fig. 1).

(4) Query image update: If the user considers the search result to be unsatisfactory, he may select one of the displayed images as a new input and then return to the first stage.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

Fig. 1. Visual result of a query.

Upon observing the operations involved, it is possible to notice The programmed optimization strategy is based on a farm that the comparison and sorting stage involves a much larger

model, in which a master process distributes the data to be computational load than the others. Luckily, the exploitation

dealt with upon a set of slave processes which analyze the data of data parallelism can be done just by dividing the workload

and return the partial results to the master once they have fin- among n independent nodes, since there are no dependencies.

ished their computations. Since this approach makes it possible This can be accomplished by distributing off-line the CBIR im-

to maintain a large degree of data handling locality, it is well age’s signatures across the processing nodes. Then, each node

suited for distributed memory multiprocessors with message can compare the image query’s signature with every available

passing communication. Further advantages of this solution are signature. In order to ease also the storage requirements, it is

its good price/performance ratio and its high level of scalabil- possible to distribute images, signatures and computation over

ity, whenever the number of images stored in the database is all of the n available nodes.

increased. In our case, the following solution has been adopted: (1) The master process computes the signature of the input

3.3. Parallel implementations without load balancing image and broadcasts it to the n slave processes. (2) The slave processes then proceed to compare the signa-

3.3.1. Global strategy ture of the input image with the signatures of the images

A remarkable feature of the signature comparison and sorting assigned to their corresponding process node. Once each stage is the problem’s fine granularity: it is possible to perform

comparison has been performed, a check is then carried an efficient data-oriented parallelization by combining the sig-

out to ascertain whether the result obtained is one of the nature comparison and sorting stages, and distributing among

best p images and, should that be the case, it is then in- the different nodes only the data needed to perform this stage,

corporated into the set which is repeatedly sorted using a which are the signatures of the DB images assigned to each

bubble sorting algorithm.

node as well as a scalar defining the total number of signatures (3) The slave processes forward the p image identifiers and to be returned, p. It has to be noted that the amount of com-

similarity measurements to the master process after communications among the corresponding processes is very small,

paring and selecting the p images which are most similar since only the input image’s signature and the p identifiers from

within each process node.

the most similar images which have been found at each node, (4) The master process collects the similarity results obtained together with their corresponding similarity measures, have to

from each of the n process nodes and sorts the n · p simi-

be exchanged among the processes involved, as we will see larity results, truncating the sort so as to include only the below.

best p.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

MASTER Query Image Signature +

selected images’ identifiers (if any)

Query Image

Signature

selected images’

p more similar list

selected images’

identifiers (if any)

Requested images to show

p more similar list

Requested images to show

+ Requested images to show

SLAVE 1

SLAVE 2

SLAVE n

Fig. 2. Process communication in the cluster implementation without load balancing.

(5) Finally, the master process requests the process nodes that Sort the partial n · p comparisons selecting the top p contain the previously selected images to forward them so

Request the p most similar images to the slave processes that they may be presented to the user and, once available,

where the corresponding images are stored proceeds to compose a mosaic that is then displayed to the

Receive the p more similar images from the nodes containing user.

them using the MPI_RECV primitive Compose the mosaic to be presented to the user

Fig. 2 represents a schematic diagram of the communication

end loop

between the processes involved in the unbalanced system. It must be noticed that each node of the heterogeneous cluster

Slave j (

runs two processes: a master to attend the user queries and a M being the number of images stored in process node j slave to provide the local results achieved by each process node

loop

to the master process of the cluster node where the query has Receive the signature of the query image for- been generated. This situation is very similar to that found on

warded from the master using the

a grid. MPI_BCAST (reception from a previous broadcast) primitive

3.3.2. MPI cluster implementation Initialize the P j set which shall contain the p The application has been programmed using the MPI libraries

better results of the comparisons as communication primitives between the master and slave pro-

for k = 1 to M do

cesses. MPI has been selected given that it currently constitutes Find the signature of the image k

a standard for message passing communications on parallel ar- Compare the query signature with that of the chitectures, offering a good degree of portability among paral-

current image obtaining the similarity measure- lel platforms [23]. The MPI version used is 6.5.6 LAM, from

ment ms jk

the Laboratory for Scientific Computing of Notre Dame Uni-

if ms jk ∈ P j then

versity, a free distribution of MPI [36]. Eliminate the worst result of P j The pseudo-code corresponding to the implementation of the

Incorporate the result corresponding to the image k master and slave processes is shown below.

to P j

Master Sort P j using a bubble sorting algorithm

loop

end if

Request an image to the user

end for

Compute its signature

Forward P j to the master

Forward the signature to each of the n slave processes using if the master requests images to compose the the MPI_BCAST

mosaic then

(broadcast) primitive Forward the requested images Receive the results of the n slave processes using the MPI_-

endif

RECV (receive) primitive

end loop

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

The size of the data corresponding to each one of the p best Therefore, the updates of the number of tasks are performed as results that are transferred from every slave to the master is around 336 bytes. Therefore, each slave process transfers 336·p

(N last · T ) + (N

cur · D)

, (1) bytes per query. For example, for p = 20 and n = 25, the

N ave

T+D

traffic involved in the response will be less than 165 kB. 3 where N cur is the number of current running tasks in the node, N last is the average of the number of tasks running from the

4. Description of the load balancing algorithm

last update, T is the total execution time of the N last tasks considered, and D is the interval of time since the last update.

A dynamic, distributed, global and highly scalable load bal- This expression gives the average number of tasks of the node ancing algorithm has been developed for CBIR application and

during the execution time of the application. So, the percentage tested with the CBIR parallel application previously described.

of workload processed in each node, W i , is evaluated as

A more detailed description of the load balancing algorithm can also be found in [7]. A load index based on the number of run-

ning tasks and the computational power of each node is used

W·N ave

to determine the nodes’ state, which is exchanged among all of the cluster nodes at regular time intervals. The initiation rule is

4.2. Information rule

receiver-triggered and based on workload thresholds. Finally, the distribution rule takes into account the heterogeneous nature

Given that the load balancing approach described here is of the cluster nodes as well as the communication time needed

dynamic, distributed and global, every node in the system needs for the workload transmission in order to divide the amount of

updated information about how loaded the remaining system workload between a pair of nodes in every load balancing op-

nodes are [42]. The selected information rule is periodic: each eration. These ideas are detailed along the following sections.

node broadcasts its own load index to the rest of the nodes at specific time instants. A periodic rule is necessary because

4.1. State rule each node has to compute the amount of workload processed by the rest of the cluster nodes, based on the average number

The load balancing algorithm is based on a load index which of tasks per node. To evaluate the average number of tasks it is estimates how loaded a node is in comparison to the rest of the

necessary that the information is updated periodically, which nodes that compose the cluster. Many approaches can be taken

makes other information rules such as event driven or under to compute the load index. Like in any estimation process, it is

demand not suitable.

necessary to find a trade-off between accuracy and cost, since keeping frequently updated node rankings according to their

4.3. Initiation rule

workload might be costly. The index is based on the number of tasks in the run-queue

The initiation rule determines the current times for perform- of each CPU [20]. These data are exchanged among all of

ing load balancing operations. It is a receiver initiated rule, the nodes in the cluster to update the global state information.

where load balancing operations involve pairs of idling and Moreover, each node takes into account the following informa-

heavily loaded nodes: whenever a processor finishes its as- tion about the rest of the cluster nodes:

signed workload, it looks for a busy node and asks it to share • Cluster heterogeneity: each node can have a different com-

part of its remaining workload. Since each node keeps informa- putational power P i , so this factor is an important parameter

tion about the amount of pending work of the remaining nodes, to take into account for computing the load index. It is de-

the selection of busy nodes is simple. fined as the inverse of the time taken by node i to process a

The initiation rule described above minimizes the number single signature.

of load balancing operations, reducing the algorithm overhead. • Total amount of workload for each node: it is evaluated when

Also, all the operations improve the system performance, be- the application begins its execution and it is updated if there

cause the total response time of the nodes involved in the load are any changes in a node.

balancing operation are equalized, provided that there are not • Percentage of the workload performed by each node, W i : it is

any additional changes in their state or they are not involved in defined in function of the total workload, the computational

other load balancing operations.

power and the number of tasks in this node. • Period of time from the last update, D, and total execution

4.4. Load balancing operation

time, T. The load balancing operation is broken down in three phases:

first it is necessary to find an adequate node which will provide

3 These figures do not take into consideration either the data corresponding

part of its workload (localization rule). Then, the amount of

to the images presented to the user or the overheads originated by the

workload to be transferred has to be computed (distribution

communication primitives, although the latter could be considered negligible.

rule ). Finally, the workload has to be actually transferred.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

4.4.1. Localization rule by a large amount of external workload. Both parameters will Whenever a node finishes its workload, it looks for a sender

be included in the nodes’ actual computational power, Pact i , node to start a load balancing operation. The receiver node

which is obtained as

checks the state of the rest of the cluster nodes and com-

putes a node list, ordered by the amount of pending work.

To select the sender node, the receiver checks its own posi-

tion in the list and selects the node which is in the symmetric This is a multi-phase application within two different phases: position; for example, if nodes are ranked according to their

comparison and sorting. Whenever the load balancing opera- workload, the node less loaded will look for the most loaded;

tion is finished, the sender node has to finish the comparison the second less loaded node will look for the second most

phase with the remaining workload. Then, it must sort all the loaded, and so on. In consequence, each pair of sender–receiver

processed workload. The receiver should compare and sort the nodes will have between both of them a similar amount of

new workload. Additionally, the communication time has to be workload.

taken into account, because the receiver cannot continue the Apart from being very simple to implement, this approach

processing until it receives the new workload. Then, the distri- gives good results since whenever a node finishes its work it is

bution rule is determined by the following expressions: placed in one end of the list, selecting a heavily loaded (in the other end of the list). This way, the selection of the sender node

W−W r

, is very coherent: the underloaded nodes take workload from

P act s the overloaded nodes, while the nodes in middle positions in

P act s

+ , the list do not receive a load balancing request (since it is very

P act r P c unlikely that a node placed in an intermediate position starts a

P act r

(4) load balancing operation). Additionally, if several nodes are looking for a sender at the

W=W s + W r ,

where T s and T r are the response times of the sender and re- same time, it is unlikely that they address their requests to the

ceiver processors, since the load balancing operation is finished. same sender. This way, situations where a loaded node receives

W is the total workload of the sender which has not still been several load balancing petitions and the rest of the loaded nodes

processed, W s is the remaining workload in the sender node do not receive any are avoided. Finally, this approach is not

after the load balancing operation and W r the workload sent time consuming, because the nodes have always up-to-date state

to the receiver. P act s and P act r are the sender and receiver information to make their own list.

current computational power. Finally, P c is the communication Whenever a node receives a load balancing request, it can

power expressed in units of workload per second. The commu- accept or reject it. In order to accept it, the sender node

nication power is obtained by computing offline the number of should have a minimum amount of work left. Otherwise, the

signatures that can be exchanged between two of the cluster sender node is near to complete its workload and the cost of

nodes per second.

the load balancing operation can be higher than finishing the This model takes into consideration two assumptions: remaining workload locally. In that case, the receiver node

• The computational power for a node is the same in both the will select another node from the list using the same proce-

comparison and sorting phases.

dure until an adequate node is found or the end of the list • The response time in both phases and the communication is reached. time are linear with respect to the workload.

Solving these expressions, the amount of both sender and re-

4.4.2. Distribution rule

ceiver workload can be computed as

The distribution rule computes the amount of work that has to be moved from the sender to the receiver node. An

2W P act r P c

, appropriate rule should take into consideration the relative

2P c P act s + 2P act r P c + P act s P act r nodes’ capabilities and availabilities, so that they finish pro-

(5) cessing their jobs at the same time (provided that no additional operations change their processing conditions). The commu-

W s = W−W r .

The values for both workloads W r and W s take into account nication time to transfer the workload among the nodes is

the heterogeneity of the nodes, their current state, the commu- also taken into consideration because the receiver node cannot

nication times and the two different phases of the application. run the new assigned task until it receives the correspond-

In consequence, the load balancing algorithm described is ing load, having an additional delay. The global equilibrium

dynamic, being able to redistribute workload among nodes at is obtained through successive operations between couples

run time, depending on how the system state changes. It is also of nodes.

distributed, because every node takes part in the load balancing The proposed distribution rule is based on two parameters:

operations. And finally, it is global, because a global view of the the number of running tasks NT i and the computational power

system state is always kept. The following section describes the P i of the nodes which take part in the operation. This reflects the

implementation of this algorithm on a CBIR system running in fact that the contribution of a powerful node might be hampered

a heterogeneous cluster.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

5. Distributed load balancing implementation on a

DISTRIBUTION PROCESS OF NODE i

heterogeneous cluster

RECEIVES LOCAL REQUEST

5.1. Process structure of the load balance implementation

FROM SLAVE

LOAD DAEMON OF NODE i

Two replicated processes are distributed among each one of

the cluster nodes: RECEIVES REQUEST

REQUEST NODE’S NUMBER

TO DEMAND LOAD

OF NODE’S NUMBER

(1) Load daemon: This process implements both the state and the information rules. (2) Distribution daemon: It collects requests from slave nodes

SORTS THE TABLE

demanding workload and proceeds with the transference. Fig. 3 shows a decomposition of all the actions that must be

SELECTS THE NODE’S NUMBER

carried out when a slave node finishes its local workload and triggers the initiation rule. First, it demands new load to the dis- DISTRIBUTION PROCESS OF NODE i

tribution process, which obtains the demanded load and sends

RECEIVES THE

SENDS THE SELECTED

it to the slave node with the purpose to allow the continuation

NODE’S NUMBER

of the computations. The following section describes the structure of the group of processes and their functions.

DISTRIBUTION PROCESS OF NODE j

SENDS LOAD REQUEST

RECEIVES

5.2. Groups of processes

TO THE SELECTED NODE

LOAD REQUEST

As mentioned in Section 3.3.2, communication and synchro- nization between processes is based on MPI. A structure of

groups of processes based on communicators [23,26,35] has COMPUTES THE been implemented, where the groups allow to establish commu- AMOUNT OF LOAD

nication structures between processes and to use global com-

DISTRIBUTION PROCESS OF NODE i

munication functions over subsets of processes. This way, each

type of process belongs to his own group: SENDS THE

RECEIVES THE

AMOUNT OF LOAD

• MPI_COMM_MS: this group is composed by the master process and all of the slave processes.

DISTRIBUTION PROCESS OF NODE j

• MPI_COMM_DIST: this group is formed by the distribution processes.

DEMANDS

RECEIVES THE

• MPI_COMM_LOAD: this group is composed by the load

THE LOAD

LOAD REQUEST

daemon of each of the nodes. The group concept is the more natural way to implement this DISTRIBUTION PROCESS OF NODE i

process scheme, because most often the messages transmitted

RECEIVES THE

SENDS THE

involve processes that belong to the same group. Fig. 4 presents

LOAD

this communication hierarchy.

5.3. Load daemon

STORES THE

LOAD

The main function of this process is to compute the local load index, to send this information to the load daemons of the other nodes and to transmit all of the information available to

SENDS THE EXECUTION

the local distribution process, whenever it is required to do so.

ORDER TO THE SLAVE

Also, it is in charge of initializing and managing a table that stores the state of the other nodes. The table stores the follow-

COMMUNICATION COMPUTATION

NOTATION:

ing information for each of the nodes: computational power,

BETWEEN PROCESSES INSIDE A NODE

average number of active tasks while the application is running, percentage of completed work, time of the last update, total ex-

Fig. 3. General overview of the whole load balancing algorithm.

ecution time with some load level and number of signatures to

be processed. information to the other nodes. The rest of the time it remains At predetermined fixed intervals, the process evaluates the

blocked waiting for messages from other processes; its func- load index of the node where it is running and sends the state

tionality depends on the received messages. Table 1 summa-

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

MPI_COMM_WORLD

450 er

MPI_COMM_MS MPI_COMM_DIST

MPI_COMM_LOAD

Computational P

Fig. 4. Group scheme.

0 5 10 15 20 25 Processor

Table 1 Messages and associated functions of the load daemon

Fig. 5. Computational power of the cluster nodes, measured in workload units/second.

Message Associated tasks identifier

rithm, the total response time of the CBIR system, with and

0 Task information

without load balance, has been measured. Additionally, two

1 The distribution process has finished and demands the identifier of a transmitter node

classical load balancing algorithms have been implemented as

2 The distribution process informs about the number of sig-

reference: the random algorithm [3] and the Probin algorithm

natures delivered to other node

[9]. The random algorithm is the one of the most simple and

3 The distribution process notifies that there are no available

distributed load balancing algorithms because each node makes

nodes to transfer load

decisions based on local information. A node is considered

4 The distribution process shows the number of signatures obtained from other node

sender if the queue length of the CPU exceeds a predetermined

5 Another load daemon informs about its new number of

and constant threshold. The receiver is selected randomly be-

signatures because their transference to other node

cause the nodes do not share any information about the status

6 Another load daemon reports about the new number of

system. The Probin algorithm is a diffusion-based algorithm,

signatures assigned to

where the information is locally exchanged defining commu-

7 Another load daemon tells that there are no nodes to transfer load

nication domains between neighbor nodes. Several levels of coordination can be established varying the domains’ size.

The experiments have been executed on a heterogeneous rizes the messages involved with the load daemon and the tasks

cluster composed of 25 nodes, linked through a 100 MB/s Eth- associated with each one.

ernet. Each of the process nodes features 4 GB of storage ca- pacity in an IDE hard disk linked through DMA with 16.6 MB/s

5.4. Distribution process transfer speed. The PC’s operating system is Linux v. 2.2.12. The heterogeneity is determined by the hard disk features. It

The main function of the load distribution process is to im- has to be noted that this component determines each node’s re- plement the initiation rule and the load balancing operation.

sponse, as shown in Fig. 5, since in this CBIR system (as in Whenever a particular slave finishes its local work, the distribu-

many others), I/O operations are predominant with respect to tion process is then alerted, evaluating therefore the initiation

CPU operations.

rule, finding a candidate node, establishing the negotiation and Two different tests have been performed for measuring the delivering the load to the slave. On the other hand, if the node

improvement achieved by the heterogeneous cluster imple- receives a load balancing request, the distribution rule must be

mentation using the distributed load balancing algorithm. The triggered and the appropriate workload is sent to the remote

first one analyzes search operations within a 30 million image node.

database using an underloaded system. Since none of the nodes are overloaded, this test studies how heterogeneity affects the

6. Analysis of the CBIR implementation with load

system performance, and how this performance is improved

balancing in a heterogeneous cluster

using the load balancing algorithm. The second experiment adds some artificial external tasks to a node in order to test

A set of experiments have been performed for testing the how well the load balancing algorithm copes with the situ- behavior of the parallel CBIR system implemented on the het-

ation of strong load unbalance. In this case the underloaded erogeneous cluster using the above distributed load balancing

nodes should wait for the overloaded one to finish the applica- algorithm. To compare the results achieved by the parallel CBIR

tion. The load balancing algorithm must remove the unloaded system with and without the distributed load balancing algo-

nodes’ idle time.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

Without algorithm Random algorithm

12000 Probin algorithm

Proposed algorithm

10000 1.12 Random algorithm 1.1 Proposed algorithm Probin algorithm 8000

ecution time (seconds) 6000

Number of processes

(b)

Number of processors

Fig. 6. Results without external tasks (speedup with respect to the algorithm without load balancing): (a) response time and (b) speedup.

Table 2

Table 3

Response time without external workload, measured in seconds

Speedup without external tasks

No. nodes Without alg.

Random alg.

Probin alg.

Proposed alg.

No. nodes

Speedup

Speedup Speedup

Probin alg. proposed alg. 5 14 362

random alg.

6.1. Tests considering cluster heterogeneity and load balancing overhead

Table 4

Standard deviation of the cluster nodes without external tasks

The main purposes of these tests were to detect the amount

No. nodes

Without alg.

Random alg.

Probin alg. Proposed alg.

of overhead introduced by the load balancing algorithm, and

how the algorithm can manage the system heterogeneity. The

tests were performed on clusters with 5, 10, 15, 20 and 25 slave

nodes plus a master node, in order to evaluate the algorithm

scalability. The results are presented in Table 2 and in Fig. 6.

Table 2 shows that the response times are always shorter with some load balancing algorithm, which means that the overhead introduced by the algorithm is smaller than the improvements achieved by using any of the implemented load balancing algorithms. From these results two main considerations can be

pointed out: Without algorithm

Random algorithm • The tested load balancing algorithms improved always the

Probin algorithm response times between 10% and 15%. The best results were

Proposed algorithm achieved by the proposed algorithm.

• The proposed approach proved to be more stable, while the results obtained with the other algorithms were less consis-

tent.

Time (seconds) 1000

Fig. 6(b) and Table 3 present the speedup of these algorithms, where the speedup refers to the improvements.

An interesting parameter for estimating the methods’ behav- 500 ior is the standard deviation of the response times of the differ-

ent cluster nodes, shown in Table 4 and in Fig. 7. The standard

5 10 15 20 25 deviation of the nodes’ response times is a measurement di-

Number of processors rectly related to idling times of nodes waiting for other nodes to finish their assignments.

Fig. 7. Standard deviation without external tasks.

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

Without algorithm

Rando algorithm

Random algorithm

Probin algorithm

1.8 Proposed algorithm

Proposed algorithm

4000 ecution time (seconds)

Ex 1.2 2000

5 10 15 20 25 5 10 15 20 25 (a)

Number of Processors

(b)

Number of processors

Fig. 8. Results with external tasks: (a) response time and (b) speedup.

Table 5

Table 6

Response time with external tasks on a node, measured in seconds

Speedup with external tasks

No. nodes Without alg.

Random alg.

Probin alg.

Proposed alg.

No. nodes

Speedup

Speedup Speedup

Probin alg. proposed alg. 5 10 548

random alg.

The load balancing algorithm presented here decreases the

Table 7

standard deviation, equilibrating the response times, while with

Standard deviation with external tasks

the random algorithm a slight reduction is achieved but with

No. nodes

Without alg.

Random alg.

Probin alg. Proposed alg.

the Probin algorithm this value is erratic, depending highly on the probed nodes. Finally, the proposed algorithm achieves the

best values of all the load balancing algorithms tested, ranging

from a reduction of the standard deviation from 86.45% with

5 nodes to 93.56% with 25 nodes with respect to the response

times without a load balancing algorithm.

Without algorithm 2500 Random algorithm For these experiments the system is slightly overloaded, hav- Probin algorithm

6.2. Results with system overload

Proposed algorithm

ing one of the nodes heavily loaded. The goal of this test is

to measure the algorithm’s ability to distribute the work of the loaded node among the remaining cluster nodes, without af-

fecting the system performance. The tests were performed on a 1500 heterogeneous cluster with 5, 10, 15, 20, and 25 slave nodes and

a master node, using a database of 12.5 million images. Table

Time (seconds) 1000

5 and Fig. 8 present the results achieved in this experiment. For these tests, the differences obtained between executions

with or without load balancing were very strong. The reductions in response times range from 45% with 5 nodes to 38% with

5 10 15 20 25 nodes. As the number of nodes increases, the differences in 25 response times decrease. Again, the best results are achieved Number of processors

with the proposed algorithm. Table 6 and Fig. 8(b) show the

Fig. 9. Standard deviation with external tasks.

speedup achieved in these tests. Finally, Table 7 and Fig. 9 present the standard deviation

rithm. This method provides only marginal improvements with results.

respect to the algorithm without load balancing for more than In these tests, the reduction of the standard deviation ranged

10–15 nodes.

from 90% to 95%. An interesting point to be remarked is the The Probin algorithm has a better behavior for less than 10 lack of consistency of the results provided by the random algo-

nodes although the relative improvements drop dramatically

J.L. Bosque et al. / J. Parallel Distrib. Comput. 66 (2006) 1062 – 1075

4000 dencies. This allows efficient cluster implementations of CBIR

systems since it is a parallel architecture that meets very well 3500

Without algorithm

Proposed algorithm

the application needs [6].

Improvements on the cluster implementation have been made by introducing a dynamic, distributed, global and scalable load 2500

balancing algorithm which has been designed specifically for the parallel CBIR application implemented on a heterogeneous

2000 cluster. An additional important feature is that the load balanc- 1500

ing algorithm takes into account the system heterogeneity originated both by the different node computational attributes and

Response time (seconds 1000 by external factors such as the presence of external tasks. 500

The experiments presented here show that the amount of overhead introduced by this method is very small. In fact, this

0 overhead is hidden by the improvements achieved whenever

0 0.5 1 1.5 2 any degree of system heterogeneity shows up, a common sit- Number of tasks per node

uation in grid systems. All these experiments have also shown

Fig. 10. Response time considering a loaded node for a 25 node cluster.

that using the load balancing algorithm results in large execution time reductions and in a more uniform distribution of the

Table 8

node’s response times, which can be detected through strong

Response time increasing the number of external tasks, measured in seconds

reductions in the response times’ standard deviation.

for a 25 node cluster

As it has been shown in the experiments presented here,

No. tasks Without alg.

Proposed alg.

another important aspect that should be stressed is the algorithm

Parallel CBIR implementations with load