Emergence of Cloud-Based Scientific Computational

23.3.2 Emergence of Cloud-Based Scientific Computational

Applications

We have surveyed several projects and initiatives resulting as cloud-based appli- cations that are primarily targeted to the scientific domain. The summary is listed

as Table 23.1 . Our main emphasis is to characterize and categorize the domain of each application, parallelism model, base platform, and cloud scalability in term of support to interconnection and integration with other types of cloud.

There are some other cloud-based scientific applications that are not listed in the table. However, the items put in the table should be representative enough to convey the idea that cloud computing has been touching multi-disciplinary fields. Our table consists of sample applications in bioinformatics, computation biology, climatology and computer science with majority categorized as bioinformatics applications. It was surprising for us to find out the interest of researchers and practitioners outside the computer science field in solving their complex problem with cloud computing paradigm. Success stories from those pioneers may naturally incite curiosity of other

23 Feasibility

Table 23.1 Some cloud-based projects for scientific purpose

Projects

Cloud scalability Study CloudBLAST

Domain

Purpose

Parallelism model

Platform

A private cloud in and (Matsunaga, Tsugawa,

Bioinformatics

Protein sequence

MapReduce using

Hadoop

virtual network E & Fortes, 2008 )

analysis

BLAST as Mapper

xperience CloudBurst (Schatz,

A private or public 2009 )

Bioinformatics

Genomic mapping

MapReduce

Hadoop

cloud (EC2) CrossBow (Langmead,

A private or public Schatz, Lin, Pop, &

Bioinformatics

DNA sequence

MapReduce using

Hadoop

cloud (EC2) Salzberg, 2009 )

alignment and

Bowtie and SOAPsnp

SNP detection

BioVLAB-Microarray

A public cloud (Yang, Choi, Choi, &

Computational

Microarray data

Workflow

XBaya

(EC2) Pierce, 2008 ) Cloud-based Classifiers

biology

analysis

A private cloud (Moretti, Steinhaeuser,

Computer science

Distributed data

Thain, & Chawla, 2008 )

Satellite data processing

A private cloud (Golpayegani &

Geoinformatics

Gridding remote

MapReduce

Hadoop

sensing data

Halem, 2009 ) MPMD climate

A public cloud application

LAM, MPICH, and

(EC2) (Evangelinos & Hill,

atmosphere-ocean

GridMPI

models

544 M.F. Simalango and S. Oh parties to implement the paradigm thus bringing cloud computing to broader use

especially in the scientific field. Observing Table 23.1 more scrupulously, we can find out that MapReduce imple- mentation in Hadoop has been the dominant approach in parallelization recently. In our opinion, this is driven by the adequate information, documentation, samples, case studies and also enterprise support for MapReduce. MapReduce model itself was initially proposed and used by Google (Dean & Ghemawat, 2004 ) and Hadoop provides its open-source Java implementation. This model consists of a map func- tion written by user that takes a set of input key/value pairs in order to generate a set of intermediate key/value pairs, and a reduce function also written by user which merges all intermediate values associated with the same intermediate key. Referring

back to Fig. 23.3 , the map function is equivalent to delegating, and partial outputting phases whereas the reduce function is to sorting and merging phase. In Hadoop implementation of MapReduce, map and reduce functions can be replaced by any executable software program. This feature is supported by a utility named Hadoop streaming, which is shipped along with default Hadoop package. For example, CloudBLAST uses Basic Local Alignment Search Tool (BLAST) through NCBI BLAST 2 implementation as the map function. This software pro- gram executes the map function in lieu by finding region of local similarity between nucleotide or protein sequences. Another example is CrossBow that reuses Bowtie to enable fast and memory-efficient alignment of short reads to mammalian genomes hence the map function substitute. The reduce function is also replaced by the invo- cation of SOAPsnp whose task is to provide Single Nucleotide Polymorphism (SNP) calls from short read alignment data. The ability to reuse legacy software may have also contributed to wider adoption of MapReduce model and Hadoop for parallel programming.

We can also notice in Table 23.1 that academia have begun using cloud infras- tructure and platform offered by the enterprises. Amazon with its variety of cloud services has been a major testbed cited in corresponding research works. Amazon EC2 and S3 are used for infrastructural cloud services while Amazon Elastic MapReduce is used as a cloud platform for MapReduce-based parallel applications. This finding leads us to a further study of cloud-computing feasibility for scientific purpose through our own experience in building and using a compute cloud.