Emergence of Cloud-Based Scientific Computational
23.3.2 Emergence of Cloud-Based Scientific Computational
Applications
We have surveyed several projects and initiatives resulting as cloud-based appli- cations that are primarily targeted to the scientific domain. The summary is listed
as Table 23.1 . Our main emphasis is to characterize and categorize the domain of each application, parallelism model, base platform, and cloud scalability in term of support to interconnection and integration with other types of cloud.
There are some other cloud-based scientific applications that are not listed in the table. However, the items put in the table should be representative enough to convey the idea that cloud computing has been touching multi-disciplinary fields. Our table consists of sample applications in bioinformatics, computation biology, climatology and computer science with majority categorized as bioinformatics applications. It was surprising for us to find out the interest of researchers and practitioners outside the computer science field in solving their complex problem with cloud computing paradigm. Success stories from those pioneers may naturally incite curiosity of other
23 Feasibility
Table 23.1 Some cloud-based projects for scientific purpose
Projects
Cloud scalability Study CloudBLAST
Domain
Purpose
Parallelism model
Platform
A private cloud in and (Matsunaga, Tsugawa,
Bioinformatics
Protein sequence
MapReduce using
Hadoop
virtual network E & Fortes, 2008 )
analysis
BLAST as Mapper
xperience CloudBurst (Schatz,
A private or public 2009 )
Bioinformatics
Genomic mapping
MapReduce
Hadoop
cloud (EC2) CrossBow (Langmead,
A private or public Schatz, Lin, Pop, &
Bioinformatics
DNA sequence
MapReduce using
Hadoop
cloud (EC2) Salzberg, 2009 )
alignment and
Bowtie and SOAPsnp
SNP detection
BioVLAB-Microarray
A public cloud (Yang, Choi, Choi, &
Computational
Microarray data
Workflow
XBaya
(EC2) Pierce, 2008 ) Cloud-based Classifiers
biology
analysis
A private cloud (Moretti, Steinhaeuser,
Computer science
Distributed data
Thain, & Chawla, 2008 )
Satellite data processing
A private cloud (Golpayegani &
Geoinformatics
Gridding remote
MapReduce
Hadoop
sensing data
Halem, 2009 ) MPMD climate
A public cloud application
LAM, MPICH, and
(EC2) (Evangelinos & Hill,
atmosphere-ocean
GridMPI
models
544 M.F. Simalango and S. Oh parties to implement the paradigm thus bringing cloud computing to broader use
especially in the scientific field. Observing Table 23.1 more scrupulously, we can find out that MapReduce imple- mentation in Hadoop has been the dominant approach in parallelization recently. In our opinion, this is driven by the adequate information, documentation, samples, case studies and also enterprise support for MapReduce. MapReduce model itself was initially proposed and used by Google (Dean & Ghemawat, 2004 ) and Hadoop provides its open-source Java implementation. This model consists of a map func- tion written by user that takes a set of input key/value pairs in order to generate a set of intermediate key/value pairs, and a reduce function also written by user which merges all intermediate values associated with the same intermediate key. Referring
back to Fig. 23.3 , the map function is equivalent to delegating, and partial outputting phases whereas the reduce function is to sorting and merging phase. In Hadoop implementation of MapReduce, map and reduce functions can be replaced by any executable software program. This feature is supported by a utility named Hadoop streaming, which is shipped along with default Hadoop package. For example, CloudBLAST uses Basic Local Alignment Search Tool (BLAST) through NCBI BLAST 2 implementation as the map function. This software pro- gram executes the map function in lieu by finding region of local similarity between nucleotide or protein sequences. Another example is CrossBow that reuses Bowtie to enable fast and memory-efficient alignment of short reads to mammalian genomes hence the map function substitute. The reduce function is also replaced by the invo- cation of SOAPsnp whose task is to provide Single Nucleotide Polymorphism (SNP) calls from short read alignment data. The ability to reuse legacy software may have also contributed to wider adoption of MapReduce model and Hadoop for parallel programming.
We can also notice in Table 23.1 that academia have begun using cloud infras- tructure and platform offered by the enterprises. Amazon with its variety of cloud services has been a major testbed cited in corresponding research works. Amazon EC2 and S3 are used for infrastructural cloud services while Amazon Elastic MapReduce is used as a cloud platform for MapReduce-based parallel applications. This finding leads us to a further study of cloud-computing feasibility for scientific purpose through our own experience in building and using a compute cloud.