Big Data Storage Sharing Security 590 pdf pdf

  Big Data Storage, Sharing, and Security

Big Data Storage, Sharing, and Security

OTHER BOOKS BY FEI HU

  

Associate Professor

Department of Electrical and Computer Engineering

The University of Alabama

  

Cognitive Radio Networks

with Yang Xiao

  

ISBN 978-1-4200-6420-9

Wireless Sensor Networks: Principles and Practice

with Xiaojun Cao

  

ISBN 978-1-4200-9215-8

Socio-Technical Networks: Science and Engineering Design

with Ali Mostashari and Jiang Xie

  

ISBN 978-1-4398-0980-8

Intelligent Sensor Networks: The Integration of Sensor Networks,

Signal Processing and Machine Learning

with Qi Hao

  

ISBN 978-1-4398-9281-7

Network Innovation through OpenFlow and SDN: Principles and Design

  

ISBN 978-1-4665-7209-6

Cyber-Physical Systems: Integrated Computing and Engineering Design

  

ISBN 978-1-4665-7700-8

Multimedia over Cognitive Radio Networks: Algorithms, Protocols,

and Experiments

with Sunil Kumar

  

ISBN 978-1-4822-1485-7

Wireless Network Performance Enhancement via Directional Antennas:

Models, Protocols, and Systems

with John D. Matyjas and Sunil Kumar

  

ISBN 978-1-4987-0753-4

Security and Privacy in Internet of Things (IoTs): Models, Algorithms,

and Implementations

  

ISBN 978-1-4987-2318-3

Spectrum Sharing in Wireless Networks: Fairness, Efficiency, and Security

with John D. Matyjas and Sunil Kumar

  

ISBN 978-1-4987-2635-1

Big Data: Storage, Sharing, and Security

  

ISBN 978-1-4987-3486-8

Opportunities in 5G Networks: A Research and Development Perspective

ISBN 978-1-4987-3954-2

  Big Data Storage, Sharing, and Security

  Edited by Fei Hu CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20160226 International Standard Book Number-13: 978-1-4987-3487-5 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been

made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-

ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright

holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this

form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may

rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-

lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-

ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the

publishers.

  

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://

www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,

978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For

organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

  

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for

identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

  For Gloria, Edwin & Edward (twins). . . . . .

  

This page intentionally left blank This page intentionally left blank Contents

  

  

  

  

  

  

  

  

  

  

  

  

  Contents

  

  

  

  

  

  

  

  Daniel E. O’Leary

  

  

  

  

  

  

  

  

  

  

  

  

  

  Big Data is one of the hottest topics today because of the large-scale data generation and distribution in computing products. It is tightly integrated with other cutting-edge network- ing technologies, including cloud computing, social networks, Internet of things, and sensor networks. Characteristics of Big Data may be summarized as four Vs, that is, volume (great volume), variety (various modalities), velocity (rapid generation), and value (huge value but very low density). Many countries are paying high attention to this area. As an example, in the United States in March 2012, the Obama Administration announced a US$200 million investment to launch the “Big Data Research and Development Plan,” which was a second major scientific and technological development initiative after the “Information Highway” initiative in 1993.

  Because Big Data is a relatively new field, there are many challenging issues to be addressed today: (1) Storage—How do we aggregate heterogeneous types of data from numerous sources, and then use fast database management technology to store the Big Data? (2) Sharing—How do we use cloud computing to share the Big Data among large groups of people? (3) Security— How do we protect the privacy of Big Data during the network sharing? This book will cover the above 3S designs, through the detailed description of the concepts and implementations.

  This book is unlike any other similar books. Because Big Data is such a new field, there are very few books covering its implementation. Although a few similar books are already published, they are mostly about the basic concepts and society impacts. They are thus not suitable for R&D people. Instead, this book will discuss Big Data management from an R&D perspective.

  Targeted Audiences: (1) Industry—company engineers can use this book as a reference for

  the design of Big Data processing and protection. There are many practical design principles covered in the chapters. (2) Academia—researchers can gain much knowledge on the latest research topics in this area. Graduate students can resolve many issues by reading the chapters. They will gain a good understanding of the status and trend of Big Data management.

  Book Architecture: The book consists of two sections: Section I. Big Data management: In this section we cover the following important

  topics:

  Spatial management: In many applications and scientific studies, there is a grow-

  ing need to manage spatial entities and their topological, geometric, or geographic properties. Analyzing such large amounts of spatial data to derive values and guide Preface Data transfer: A content delivery network with large data centers located around

  the world requires Big Data transfer for data migration, updates, and backups. As cloud computing becomes common, the capacity of the data centers and both the intranetwork and internetwork of those data centers increase.

  Data processing: Dealing with “Big Data” problems requires a radical change

  in the philosophy of the organization of information processing. Primarily, the Big Data approach has to modify the underlying computational model to manage uncertainty in the access to information items in a huge nebulous environment.

Section II. Big Data Security: Security is a critical aspect after Big Data is integrated

  with cloud computing. We will provide technical details on the following aspects:

  Security: To achieve a secure, available, and reliable Big Data cloud-based service,

  we not only present the state-of-the-art of Big Data cloud-based services, but also a novel architecture to manage reliability, availability, and performance for accessing Big Data services running on the cloud.

  Privacy: We will examine privacy issues in the context of Big Data and poten-

  tial data mining of that data. Issues are analyzed based on the emerging unique characterizations associated with Big Data: the Big Data Lake, “thing” data, the quantified self, repurposed data, and the generation of knowledge from unstruc- tured communication data, that is, Twitter Tweets. Each of those sets of emerging issues is analyzed in detail for their potential impact on privacy.

  Accountability: Accountability of user data access on a specific application helps in monitoring, controlling, and assessing data usage by the user for the application.

  Data loss is the main source of leaking information that may possibly compromise the privacy of individual and/or organization. Therefore, the naive question is, “how can data leakages be controlled and detected?” The simple answer to this would be audit logs and effective measures of data usage.

  The chapters have detailed technical descriptions of the models, algorithms, and implementa- tions of Big Data management and security aspects. There are also accurate descriptions on the state-of-the-art and future development trends of Big Data applications. Each chapter also includes references for readers’ further studies.

  Thank you for reading this book. We believe that it will help you with the scientific research and engineering design of Big Data systems. We welcome your feedback.

  Fei Hu

  University of Alabama, Tuscaloosa, Alabama

  

  Dr. Fei Hu is currently a professor in the Department of Electrical and Computer Engineering at the Univer- sity of Alabama, Tuscaloosa, Alabama. He earned his PhD degrees at Tongji University (Shanghai, China) in the field of signal processing (in 1999), and at Clark- son University (New York) in electrical and computer engineering (in 2002). He has published over 200 jour- nal/conference papers and books. Dr. Hu’s research has been supported by the U.S. National Science Foun- dation, Cisco, Sprint, and other sources. His research expertise can be summarized as 3S: Security, Signals,

  Sensors: (1) Security—This deals with overcoming

  different cyber attacks in a complex wireless or wired network. His current research is focused on cyber- physical system security and medical security issues. (2) Signals—This mainly refers to intelligent signal processing, that is, using machine learning algorithms to process sensing signals in a smart way to extract patterns (i.e., pattern recogni- tion). (3) Sensors—This includes microsensor design and wireless sensor networking issues.

  

This page intentionally left blank This page intentionally left blank

   Emad Abd-Elrahman RST Department Telecom Sudparis Evry, France Ablimit Aji Analytics Lab Database Systems Hewlett Packard Labs Palo Alto, California Usamah AlGemili Department of Computer Science George Washington University Washington, DC Adi Alhudhaif Department of Computer Science Prince Sattam bin Abdulaziz University Al-Kharj, Saudi Arabia Faisal Alsaby Department of Computer Science George Washington University Washington, DC Nadia Bennani

  INSA-Lyon LIRIS Department University of Lyon Lyon, France

  Simon Y. Berkoich COMStar Computing Technology Institute and Department of Computer Science George Washington University Washington, DC Nevil Brownlee Department of Computer Science University of Auckland Auckland, New Zealand Lionel Brunie

  INSA-Lyon LIRIS Department University of Lyon Lyon, France Thomas Cerqueus

  INSA-Lyon LIRIS Department University of Lyon Lyon, France Ernesto Damiani Department of Computer Technology University of Milan Milan, Italy Manik Lal Das Dhirubhai Ambani Institute of Information and Communication Technology

  Gujarat, India Contributors Vijay Gadepally MIT Lincoln Laboratory Lexington, Massachusetts Ibrahim A. Gomaa Computer and Systems Department National Telecommunication Institute Cairo, Egypt Fouad Amine Guenane ENST Telecom ParisTech Paris, France Benjamin Habegger

  INSA-Lyon LIRIS Department University of Lyon Lyon, France Ariel Hamlin MIT Lincoln Laboratory Lexington, Massachusetts Omar Hasan

  INSA-Lyon LIRIS Department University of Lyon Lyon, France Yuh-Jong Hu Department of Computer Science National Chengchi University Taipei Taipei, Taiwan Rasheed Hussain Department of Computer Science and Engineering Hanyang University

  Ansan, South Korea Jeremy Kepner MIT Lincoln Laboratory Lexington, Massachusetts Donghyun Kim Department of Mathematics and Physics North Carolina Central University Durham, North Carolina

  Harald Kosch Department of Informatics and Mathematics University of Passau Passau, Germany Duoduo Liao COMStar Computing Technology Institute Washington, DC Dongxi Liu CSIRO Clayton South Victoria, Australia Wen-Yu Liu Department of Computer Science National Chengchi University Taipei Taipei, Taiwan Jianguo Lu School of Computer Science University of Windsor Ontario, Canada Aniket Mahanti Department of Computer Science University of Auckland Auckland, New Zealand Michele Nogueira Department of Informatics NR2—Federal University of Parana Curitiba, Brazil Heekuck Oh Department of Computer Science and Engineering Hanyang University

  Ansan, South Korea Daniel E. O’Leary University of Southern California Los Angeles, California Albert Reuther MIT Lincoln Laboratory Lexington, Massachusetts Jun Sakuma Department of Computer Science University of Tsukuba Tsukuba, Japan

  Contributors Nabil Schear MIT Lincoln Laboratory Lexington, Massachusetts Ahmed Serhrouchni ENST Telecom ParisTech Paris, France Emily Shen MIT Lincoln Laboratory Lexington, Massachusetts Junggab Son Department of Mathematics and Physics North Carolina Central University Durham, North Carolina Mayank Varia Boston University Boston, Massachusetts Dong Wang Department of Computer Science and Engineering University of Notre Dame

  Notre Dame, Indiana Fusheng Wang Department of Biomedical Informatics and Department of Computer Science Stony Brook University Stony Brook, New York Shenlu Wang School of Computer Science and Engineering University of New South Wales Sydney, Australia Yan Wang

  School of Information Central University of Finance and Economics Beijing, China

  J. Gerard Wolff CognitionResearch.org Menai Bridge, United Kingdom Sophia Yakoubov MIT Lincoln Laboratory Lexington, Massachusetts Maryam Yammahi Department of Computer Science George Washington University Washington, DC and College of Information

  Technology United Arab Emirates University Al Ain, United Arab Emirates Arkady Yerukhimovich MIT Lincoln Laboratory Lexington, Massachusetts Se-young Yu Department of Computer Science University of Auckland Auckland, New Zealand John Zic CSIRO Clayton South Victoria, Australia

  

This page intentionally left blank This page intentionally left blank

  

  

This page intentionally left blank This page intentionally left blank

Chapter 1 Challenges and Approaches in Spatial Big Data Management Ablimit Aji Fusheng Wang CONTENTS

  1.1 Introduction 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.2 Big Spatial Data and Applications 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.2.1 Spatial analytics for derived scientific data

  4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.2.2 GIS and social media applications

  5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.3 Challenges and Requirements 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.4 Spatial Big Data Systems and Techniques

  7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.4.1 MapReduce-based spatial query processing

  7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1.4.2 Effective spatial data partitioning

  8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  

1.4.3 Query co-processing with GPU and CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  10 1.4.3.1 Task assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  10 1.4.3.2 Effects of task granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  11 1.5 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  11 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  11 References 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.1 Introduction

  Advancements in computer technology and the rapid growth of the Internet have brought many changes to society. More recently, the Big Data paradigm has disrupted many industries ranging from agriculture to retail business, and fundamentally changed how businesses operate and make decisions at large. The rise of Big Data can be attributed to two main reasons:

  Big Data: Storage, Sharing, and Security

  First, high volumes of data generated and collected from devices. The rapid improvement of high-resolution data acquisition technologies and sensor networks have enabled us to capture large amounts of data at an unprecedented scale and rate. For example, the GeoEye-1 satellite has the highest resolution of any commercial imaging system and is able to collect images with a ground resolution of 0.41 m in the panchromatic or black and white mode [1]; the Sloan Digital Sky Survey (SDSS), with a rate of about 200 GB per night, has amassed more than 140 TB of information [5]; and the modern medical imaging scanners can capture the micro- anatomical tissue details at the billion pixel resolution [13].

  Second, traces of human activity and crowd-sourcing efforts facilitated by the Internet. The proliferation of cost-effective and ubiquitous positioning technologies, mobile devices, and sen- sors have enabled us to collect massive amounts of spatial information of human and wildlife activity. For example, FourthSquare—a popular local search and discovery service—allow users to check-in at more than 60 million venues, and so far has more than 6 billion check-ins [2]. Driven by the business potential, more and more businesses are providing services that are

  

location-aware. At the same time, the Internet has made remote collaboration so easy that, now,

  a crowd can even generate a free mapping of the world autonomously. OpenStreetMap [3] is a large collaborative mapping project, which is generated by users around the globe, and it has more than two million registered users as of this writing.

  In many applications and scientific studies, there is a growing need to manage spatial enti- ties and their topological, geometric, or geographic properties. Analyzing such large amounts of spatial data to derive values and guide decision making have become essential to business suc- cess and scientific progress. For example, location-based social networks (LBSNs) are utilizing large amounts of user location information to provide geo-marketing and recommendation services. Social scientists are relying on such data to study dynamics of social systems and understand human behavior. Epidemiologists are combining such spatial data with public health data to study the patterns of disease outbreak and spread. In all those domains, spatial Big Data analytics infrastructure is a key enabler.

  Over the last decade, the Big Data technology stack and the software ecosystem has evolved to cope with most common use cases. However, modern data-intensive spatial applications require a different approach to be able to handle unique requirements of spatial Big Data.

  In the rest of this chapter, first we provide examples of data-intensive spatial applications, and discuss the unique challenges that are common to them. Then, we present major research efforts, data-intensive computing techniques, and software systems that are intended to address these challenges. Lastly, we conclude the chapter with a discussion on future outlook of this area.

1.2 Big Spatial Data and Applications

  The rapid growth of spatial data is driven not only by conventional applications, but also by emerging scientific applications and large internet services that have become data-intensive and compute-intensive.

1.2.1 Spatial analytics for derived scientific data

  With the rapid improvement of data acquisition technologies such as high-resolution tissue slide scanners and remote sensing instruments, it has become more efficient to capture extremely

  Challenges and Approaches in Spatial Big Data Management

  become an emerging field in the past decade, where examination of high-resolution images of tissue specimens enables novel, more effective ways of screening for disease, classifying dis- ease states, understanding its progression, and evaluating the efficacy of therapeutic strategies. In clinical environment, medical professionals have been relying on the manual judgment from pathologists—a process inherently subject to human bias—to diagnose, and understand the disease condition.

  Today, in silico pathology image analysis offers a means of rapidly carrying out quantitative, reproducible measurements of micro-anatomical features in high-resolution pathology images and large image datasets. Medical professionals and researchers can use computer algorithms to calculate the distribution of certain cell types, and perform associative analysis with other data such as patient genetic composition and clinical treatment.

Figure 1.1 shows a protocol for in silico pathology image analysis pipeline. From left to the right, the sub-figures represent: glass slides, high-resolution image scanning, whole slide

  images, and automated image analysis. The first three steps are data acquisition processes that are mostly done in a pathology laboratory environment, and the final step is where the computerized analysis is performed. In the image analysis step, regions of micro-anatomical objects (millions per image) such as nuclei and cells are computed through image segmentation algorithms, represented with their boundaries, and image features are extracted from these objects. Exploring the results of such analysis involves complex queries such as spatial cross- matching, overlay of multiple sets of spatial objects, spatial proximity computations between objects, and queries for global spatial pattern discovery. These queries often involve billions of spatial objects and heavy geometric computations.

  Scientific simulation also generates large amounts of spatial data. Scientists often use models to simulate natural phenomena, and analyze the simulation process and data. For example, earth science uses simulation models to help predict the ground motion during earth- quakes. Ground motion is modeled with an octree-based hexahedral mesh, using soil density as input. Simulation tools calculate the propagation of seismic waves through the Earth by approximating the solution to the wave equation at each mesh node. During each time step, for each node in the mesh, the simulator calculates the node velocity in spatial directions, and records those information to the primary storage. The simulation result is a spatio temporal earthquake data set describing the ground velocity response [6]. As the scale of the experiment increases, the resulting dataset also increases, and scientists often struggle to query and manage such large amounts of spatio temporal data in an efficient and cost-effective manner.

1.2.2 GIS and social media applications

  Volunteered geographic information (VGI) further enriched global information system (GIS) world with massive amounts of user-generated geographical and social data. VGI is a special case of the larger Internet phenomenon—user-generated content—in the GIS domain. Every- day Internet users can provide, modify, and share geographical data using interactive online

  Figure 1.1: Derived spatial data in pathology image analysis.

  Big Data: Storage, Sharing, and Security

  services such as OpenStreetMap [3], Wikimapia, GoogleMap, GoogleEarth, and Microsoft’s Virtual Earth. The spatial information needs to be constantly analyzed and corroborated to track changes, and understand the current status. Most often, a spatial database system is used to perform such analysis.

  Recently, the explosive growth of social media applications contributed massive amounts of user-generated geographic information in the form of tweets, status updates, check-ins, Waze, and traffic reports. Furthermore, if such geospatial information is not available, automated geo tagging/coding tools can infer and assign an approximate location to those contents. Analysis of such large amounts of data has implications for many applications—both commercial and academic. In [11] authors have used the geospatial information to investigate the relationship between the geographic location of protestors attending demonstrations in the 2013 Vinegar protests in Brazil and the geographic location of users that tweeted the protests. Another exam- ple is location-based targeted advertising [24] and recommendation [18]. Those online services and GIS systems are backed by conventional spatial database systems that are optimized for different application requirements.

1.3 Challenges and Requirements

  Modern data-intensive spatial analytics applications are different from conventional applica- tions in several aspects. They involve the following:

  Large volumes of multidimensional data: Conventional warehousing applications deal

  with data generated from business transactions. As a result, the underlying data (such as numbers and strings) tend to be relatively simple and flat. However, this is not the case for the spatial applications which deal with massive amounts of geometry shapes and spatial objects. For example, a typical whole slide pathology contains more than 20 billion pixels, millions of objects, and 100 million derived image features. A single study may involve thousands of images analyzed with dozens of algorithms— with varying parameters—to generate many different result sets to be compared and consolidated, at the scale of tens of terabytes. A moderate-size healthcare operation can routinely generate thousands of whole slide images per day, leading to petabytes of analytical results per year. A single 3D pathology image could come from a thousand slices and take 1 TB storage, containing several millions to 10 millions of derived 3D surface objects.

  High computation complexity: Most spatial queries involve multidimensional geometric

  computations that are often compute-intensive. While spatial filtering through minimum bounding rectangles (MBRs) can be accelerated through spatial access methods, spatial refinements such as polygon intersection verification are highly expensive operations. For example, spatial join queries such as spatial cross-matching or spatial overlay can be very expensive to process. This is mainly due to the polynomial complexity of many geometric computation methods. Such compute-intensive geometric computation, com- bined with the large volumes of Big Data requires a high-performance solution.

  Complex spatial queries: Spatial queries are complex to express in current spatial data

  analytics systems. Most scientific researchers and spatial application developers are often interested in running queries that involve complex spatial relationships such as nearest neighbor query, and spatial pattern queries. Such queries are not well supported in current spatial database systems. Frequently, users are forced to write database user-defined functions to be able to perform the required operations. SQL—structured

  Challenges and Approaches in Spatial Big Data Management

  and become the de facto standard for querying the data. While most spatial queries can be expressed in SQL, due to the structural differences in the programming model, efficient SQL-based spatial queries are often hard to write and requires considerable optimization efforts.

  A major requirement for the spatial analytics systems is fast query response. Scientific research or analytics in general, is an iterative and exploratory process in which large amounts of data can be generated quickly for the initial prototyping and validation. This requires a scalable architecture that can query spatial data on a large scale. Another requirement is to support queries on a cost-effective architecture such as commodity clusters or cloud environments. Meanwhile, scientific researchers and application developers often prefer expressive query languages over programming API, without worrying about how the queries are translated, optimized, and executed. With the rapid improvement of instrument resolutions, the increased accuracy of data analysis methods, and the massive scale of observed data, complex spatial queries have become increasingly compute-intensive and data-intensive.

1.4 Spatial Big Data Systems and Techniques

  Two mainstream approaches for large-scale data analysis are parallel database systems [15] and MapReduce-based systems [14]. Both approaches share certain common design elements: they both employ a shared-nothing architecture [25], and deployed on a cluster of independent nodes via a high-speed interconnecting network; both achieve parallelism by partitioning the data and processing the query in parallel on each partition.

  However, parallel database approach has major limitations on managing and querying spatial data at massive scale. Parallel database management systems (DBMSs) tend to reduce the I/O bottleneck through partitioning of data on multiple parallel disks and are not optimized for computational-intensive operations such as spatial and geometric computations. Partitioned parallel DBMS architecture often lacks effective spatial partitioning to balance data and task loads across database partitions. While it is possible to induce a spatial partitioning, fixed grid tiling, for example, and map such partitioning to one dimensional attribute distribution key, such an approach fails to handle boundary objects for accurate query processing. Scaling out spatial queries through a parallel database infrastructure is possible while being costly, and such approach is explored in [27,28]. More recently, Spark [31] has emerged as a new data processing framework for handling iterative and interactive workloads.

  Due to both computational intensity and data intensity of spatial workloads, large-scale parallelization often holds the key to achieving high-performance spatial queries. As the cloud- based cluster computing technology gets mature and economically scalable, MapReduce-based systems offer an alternative solution for data and compute-intensive spatial analytics at large scale. Meanwhile, parallel processing of queries rely on effective data partitioning to scale. Considering that spatial workloads are often compute-intensive [7,22], how to utilize hardware accelerators for query co-processing is a very promising technique as modern computer systems are embracing heterogeneous architecture that combines graphics processing unit (GPU) and CPU [12]. In the rest of the chapter, we elaborate each of these techniques in greater detail, and summarize state-of-the-art approaches and systems.

1.4.1 MapReduce-based spatial query processing

  MapReduce is a very scalable parallel processing framework that is designed to process flat

  Big Data: Storage, Sharing, and Security

  objects, and several systems have emerged over the past few years to fill this gap. Well- known systems and prototypes include HadoopGIS [8–10], SpatialHadoop [16,17], Parallel Secondo [20,21], and GIS tools for Hadoop [4,30]. These systems are based on the open source implementation of MapReduce—Hadoop, and provides similar analytics functionality. How- ever, they differ in implementation details and architecture: HadoopGIS and SpatialHadoop are pure MapReduce-based query evaluation systems; Parallel Secondo is a hybrid system that combines a database engine with MapReduce; and GIS tools for Hadoop is a functional extension of Hive [26] with user-defined functions.

  MapReduce relies on the partitioning of data to process them in parallel, and it is the key for a high-performance system. In the context of large-scale spatial data analytics, an intuitive approach is to partition the dataset based on the spatial attribute, and assign spatial objects to partitioned regions (or tiles). Consequently, generated tiles form a parallelization unit that can be processed independently in parallel. A MapReduce-based spatial query processing system takes advantage of such partitioning to achieve high performance. Algorithm 1.1 illustrates a general design framework for such systems, and all the above-mentioned systems follow this framework while implementation details may vary.

  Algorithm 1.1: 1 Typical workflow of spatial query processing on MapReduce 2 A. Data/space partitioning 3 for

  B. Data storage of partitioned data on HDFS 4 tile in input collection do 5 Indexing building for objects in the tile 6 Tile based spatial querying processing 7 E. Boundary object handling 8 G. Data aggregation

  H. Result storage on HDFS Initially the dataset is spatially partitioned to generate tiles as shown in step A. In step B, spatial objects are assigned unique tile identifiers (UIDs), merged, and stored into Hadoop Distributed File System (HDFS). Step C is for pre-processing queries, which could be queries that do global index-based filtering. Step D does tile-based spatial query processing independently and parallelized across a large number of cluster nodes. Step E provides handling of boundary objects that arise from the partitioning. Step F is for post-query processing, and step G performs data aggregation. Finally, the query results are persisted to HDFS, which can be input to the next query operator.

  Following such framework, spatial queries such as spatial join query, spatial range query, and nearest neighbor query can be implemented efficiently. Reference implementations are provided in HadoopGIS, SpatialHadoop, and Parallel Secondo.

1.4.2 Effective spatial data partitioning

  Spatial data partitioning is an essential initial step to define, generate, and represent partitioned data. Effective data partitioning is critical for task parallelization, load balancing, and directly affects system performance. Generally a space-oriented partitioning can be applied to generate data partitions, and the concept is illustrated in Figure 1.2 in which the spatial data is partitioned into uniform grids.

  However, there are several problems with this approach: (1) As spatial objects (e.g.,

  Challenges and Approaches in Spatial Big Data Management Figure 1.2: An example of spatial data partitioning.

  produce objects spanning multiple cell grids, which need to be replicated and post-processed. If such objects account for a considerable fraction of the dataset, the overall query performance would suffer from such boundary handling overhead. (2) Fixed grid partitioning is skew-averse, whereas data in most real-world spatial applications are inherently highly skewed. In such case, it is very likely that parallel processing nodes assigned to process those dense regions will become the stragglers, and the overall query processing efficiency will suffer [23].

  The boundary problem can be addressed in two different ways—multi-assignment, single-

  

join (MASJ) and single-assignment, multi-join (SAMJ) [19,32]. In MASJ approach, each

  boundary object is replicated to each tile that overlaps with the object. During the query processing phase, each partition is processed only once without considering the boundary objects. Then a de-duplication step is initiated to remove the redundancies that resulted from the replication. However, in SAMJ approach, each boundary object is only assigned to one tile. Therefore, during the query processing phase, each tile is processed multiple times to account for the boundary objects.

  Both approaches introduce extra query processing overhead. In MASJ, the replication of boundary objects incurs extra storage cost and computation cost. In SAMJ, however, only extra computation cost is incurred by processing the same partition multiple times. Hadoop-GIS and SpatialHadoop takes the MASJ approach and modify, query processing steps account for replicated objects. However, depending on the application requirement, such design choice can be re-evaluated and modified to achieve better performance.

  The data skew problem can be mitigated through skew-aware partitioning approaches that can create balanced partitions. HadoopGIS uses a multi-step approach named SATO which can partition a geospatial dataset into balanced regions while minimizing the number of boundary

  

objects. SATO represents the four main steps in this framework for spatial data partitioning:

  S ample, Analyze, Tear, and Optimize. First, a small fraction of the dataset is sampled to identify overall global data distribution with potential dense regions. The sampled data is analyzed with a partition analyzer that produces a coarse partition scheme in which each partition region is expected to contain roughly equal amounts of spatial objects. Later, these coarse partition regions are further processed with a partitioning component that tears the regions into more granular partitions that are much less skewed. Finally, generated partition meta-information is aggregated to produce multilevel partition indexes and additional partition

  Big Data: Storage, Sharing, and Security

  SpatialHadoop also creates balanced partitions in similar manner. Specifically, an appro- priate partition size is estimated, and rectangular boundaries are generated according to such partition parameter. Then, a MapReduce job is initiated to create spatial partitions that corre-

  • Tree-based partitioning is used to ensure the partition sponds to such configuration. An R size constraint.

1.4.3 Query co-processing with GPU and CPU

  Most spatial queries are compute-intensive [7,22], as they involve geometric computations on complex multidimensional spatial objects. While spatial filtering through MBRs can be accelerated through spatial access methods, spatial refinements such as polygon intersection verification are highly expensive operations. For example, spatial join queries such as spatial cross-matching or overlaying multiple sets of spatial objects on an image or map can be very expensive to process.

  GPUs have been successfully utilized in numerous applications that require high- performance computation. Mainstream general purpose GPUs come with hundreds of cores, and can run thousands of threads in parallel. Compared to the multi-core computer systems (dozens of cores), GPUs can scale to large number of threads in a cost-effective manner. In the coming years, such heterogeneous parallel architecture will become dominant, and software systems must fully exploit such heterogeneity to deliver performance growth [12].

  In most cases, spatial algorithms are designed for executing on the CPUs, and the branch- intensive nature of CPU-based algorithms require the algorithm to be rewritten for GPUs for satisfactory performance. For such need, PixelBox is proposed in [29]. PixelBox is an algorithm specifically designed for accelerating cross-matching queries on the GPUs. It first transforms the vector-based geometry representation into raster representation using a pixelization method, and performs operations on such representations in parallel. The pixelization method reduces the geometry calculation problem into simple pixel position checking problem, and it is very suitable for execution on GPUs. Since testing the position of one pixel is totally independent of another, it can parallelize the computation by having multiple threads process the pixels in parallel. Furthermore, since the position of different pixels are computed against the same pair of polygons, the operations performed by different threads follow the single instruction multiple data (SIMD) fashion, a parallel computation model that GPUs are designed for. Experimental results [29] suggest that PixelBox achieves almost an orders of magnitude speedup compared to the CPU implementation, and significantly reduces the cost of computation.

1.4.3.1 Task assignment

  One critical issue for GPU-based parallelization is task assignment. For example, in a recent approach that combines MapReduce and GPU-based query processing [7], tasks arrive in the form of data partitions along with spatial query operation on the data. Given a partition, the query optimizer has to decide which device should be assigned to execute the task. Such decision is not simple, and it depends on how much speedup can be obtained by assigning it to CPU or GPU. If we schedule a small task on GPU, we may not only get very little speedup, the opportunity cost of executing some other high speedup task on GPU can be high.

  In such cases, a predictive modeling approach can offer a reasonable solution. Similar to the speculative execution model in Hadoop, a fraction of data (10%, for example) is used for performance profiling and model training. Then, regression or machine learning approaches are used to derive the performance model, and corresponding model parameters. Later during the runtime, the derived model is used to predict the potential speedup factor for current task, and

  Challenges and Approaches in Spatial Big Data Management

1.4.3.2 Effects of task granularity

  Data need to be shipped to the device memory to be executed on the GPU device. Such data transfer incurs certain I/O cost. While the memory bandwidth between GPU and CPU is much higher compared to the bandwidth between memory and the disk, it should be minimized to achieve optimal performance. To achieve optimal speedup, the compute-to-transfer ratio needs to be high for GPU applications. Therefore, applications need to adjust the partition granularity to fully utilize system resources. While larger partitioning is ideal for achieving higher speedup on GPU, it causes data skew which is detrimental for MapReduce system performance. At the same time, a very small partition is not a good candidate for hardware acceleration.

1.5 Discussion and Conclusion

  In this chapter, we have discussed several representative spatial Big Data applications, common challenges in this domain, and potential solutions toward those challenges. Spatial Big Data from various application domains share many similar requirements with enterprise Big Data, but has its own unique characteristics—spatial data are multidimensional, spatial queries are complex, and spatial query processing comes with high computational complexity. As the volume of data grow continuously, we need efficient Big Data systems and data management techniques to be able to cope with such challenges. MapReduce-based massively parallel query processing systems offer a scalable, yet cost-effective solution for processing large amounts of spatial data. While relying on such framework, effective data partitioning techniques can be critical for facilitating massive parallelism, and achieving satisfactory query performance. Meanwhile, as multi-core computer architecture and programming techniques become mature, hardware-accelerated query co-processing on GPUs can further improve query performance for large-scale spatial analytics tasks.

  Acknowledgments

  Fusheng Wang acknowledges that this material is based on work supported in part by NSF CAREER award IIS 1350885, NSF ACI 1443054, and by the National Cancer Institute under grant No. 1U24CA180924-01A1.