Big Data Principles and Paradigms

  Big Data

  Big Data Principles and Paradigms

  Edited by Rajkumar Buyya

  

The University of Melbourne and Manjrasoft Pty Ltd, Australia

Rodrigo N. Calheiros

  The University of Melbourne, Australia Amir Vahid Dastjerdi

  The University of Melbourne, Australia AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO

  SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO Morgan Kaufmann is an imprint of Elsevier

50 Hampshire Street, 5th Floor, Cambridge, MA 02139, USA Copyright © 2016 Elsevier Inc. All rights reserved.

  No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our

arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found

at our website: .

This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may

be noted herein).

  Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

  Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any

information, methods, compounds, or experiments described herein. In using such information or methods they should be

mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

  Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library

  ISBN: 978-0-12-805394-2 For information on all Morgan Kaufmann publications

  Publisher: Todd Green Acquisition Editor: Brian Romer Editorial Project Manager: Amy Invernizzi Production Project Manager: Punithavathy Govindaradjane Designer: Victoria Pearson

  List of contributors T. Achalakul

  King Mongkut’s University of Technology Thonburi, Bangkok, Thailand

  P. Ameri

  Karlsruhe Institute of Technology (KIT), Karlsruhe, Baden-Württemberg, Germany

  A. Berry

  Deontik, Brisbane, QLD, Australia

  N. Bojja

  Machine Zone, Palo Alto, CA, USA

  R. Buyya

  The University of Melbourne, Parkville, VIC, Australia; Manjrasoft Pty Ltd, Melbourne, VIC, Australia

  W. Chen

  University of News South Wales, Sydney, NSW, Australia

  C. Deerosejanadej

  King Mongkut’s University of Technology Thonburi, Bangkok, Thailand

  A. Diaz-Perez

  Cinvestav-Tamaulipas, Tamps., Mexico

  H. Ding

  Xi’an Jiaotong University, Shaanxi, China

X. Dong

  Huazhong University of Science and Technology, Wuhan, Hubei, China

  H. Duan

  The University of Melbourne, Parkville, VIC, Australia

  S. Dutta

  Max Planck Institute for Informatics, Saarbruecken, Saarland, Germany

  A. Garcia-Robledo

  Cinvestav-Tamaulipas, Tamps., Mexico

  V. Gramoli

  University of Sydney, Sydney, NSW, Australia

  X. Gu

  Huazhong University of Science and Technology, Wuhan, Hubei, China

  J. Han

  Xi’an Jiaotong University, Shaanxi, China

B. He

  Nanyang Technological University, Singapore, Singapore

  List of contributors xvi S. Ibrahim

  Inria Rennes – Bretagne Atlantique, Rennes, France

  Z. Jiang

  Xi’an Jiaotong University, Shaanxi, China

  S. Kannan

  Machine Zone, Palo Alto, CA, USA

  S. Karuppusamy

  Machine Zone, Palo Alto, CA, USA

A. Kejariwal

  Machine Zone, Palo Alto, CA, USA

  B.-S. Lee

  Nanyang Technological University, Singapore, Singapore

  Y.C. Lee

  Macquarie University, Sydney, NSW, Australia

X. Li

  Tsinghua University, Beijing, China

  R. Li

  Huazhong University of Science and Technology, Wuhan, Hubei, China

  K. Li

  State University of New York–New Paltz, New Paltz, NY, USA

  H. Liu

  Huazhong University of Science and Technology, Wuhan, China

  P. Lu

  University of Sydney, Sydney, NSW, Australia

  K.-T. Lu

  Washington State University, Vancouver, WA, United States

  Z. Milosevic

  Deontik, Brisbane, QLD, Australia

  G. Morales-Luna

  Cinvestav-IPN, Mexico City, Mexico

  A. Narang

  Data Science Mobileum Inc., Gurgaon, HR, India

  A. Nedunchezhian

  Machine Zone, Palo Alto, CA, USA

  D. Nguyen

  Washington State University, Vancouver, WA, United States

  L. Ou

  Hunan University, Changsha, China

  List of contributors xvii

  S. Prom-on

  King Mongkut’s University of Technology Thonburi, Bangkok, Thailand

  Z. Qin

  Hunan University, Changsha, China

  F.A. Rabhi

  University of News South Wales, Sydney, NSW, Australia

  K. Ramamohanarao

  The University of Melbourne, Parkville, VIC, Australia

  T. Ryan

  University of Sydney, Sydney, NSW, Australia

  R.O. Sinnott

  The University of Melbourne, Parkville, VIC, Australia

  S. Sun

  The University of Melbourne, Parkville, VIC, Australia

  Y. Sun

  The University of Melbourne, Parkville, VIC, Australia

  S. Tang

  Tianjin University, Tianjin, China

  P. Venkateshan

  Machine Zone, Palo Alto, CA, USA

  S. Wallace

  Washington State University, Vancouver, WA, United States

  P. Wang

  Machine Zone, Palo Alto, CA, USA

  C. Wu

  The University of Melbourne, Parkville, VIC, Australia

  W. Xi

  Xi’an Jiaotong University, Shaanxi, China

  Z. Xue

  Huazhong University of Science and Technology, Wuhan, Hubei, China

  H. Yin

  Hunan University, Changsha, China

  G. Zhang

  Tsinghua University, Beijing, China

  M. Zhanikeev

  Tokyo University of Science, Chiyoda-ku, Tokyo, Japan

X. Zhao

  Washington State University, Vancouver, WA, United States

  List of contributors xviii W. Zheng

  Tsinghua University, Beijing, China

  A.C. Zhou

  Nanyang Technological University, Singapore, Singapore

  A.Y. Zomaya

  University of Sydney, Sydney, NSW, Australia

  About the Editors

Dr. Rajkumar Buyya is a Fellow of IEEE, a professor of Computer Science and Software Engineering,

  a Future Fellow of the Australian Research Council, and director of the Cloud Computing and Distributed Systems (CLOUDS) Laboratory at the University of Melbourne, Australia. He is also serving as the founding CEO of Manjrasoft, a spin-off company of the University, commercializing its innovations in cloud computing. He has authored over 500 publications and four textbooks, in- cluding Mastering Cloud Computing, published by McGraw Hill, China Machine Press, and Morgan Kaufmann for Indian, Chinese and international markets respectively. He also edited several books in- cluding Cloud Computing: Principles and Paradigms (Wiley Press, USA, Feb. 2011). He is one of the most highly cited authors in computer science and software engineering worldwide (h-index=98, g-index=202, 44800+ citations). The Microsoft Academic Search Index ranked Dr. Buyya as the world’s top author in distributed and parallel computing between 2007 and 2015. A Scientometric

  

Analysis of Cloud Computing Literature by German scientists ranked Dr. Buyya as the World’s Top-

Cited (#1) Author and the World’s Most-Productive (#1) Author in Cloud Computing.

  Software technologies for grid and cloud computing developed under Dr. Buyya’s leadership have gained rapid acceptance and are in use at several academic institutions and commercial enterprises in 40 countries around the world. Dr. Buyya has led the establishment and development of key com- munity activities, including serving as foundation chair of the IEEE Technical Committee on Scalable Computing and five IEEE/ACM conferences. These contributions and international research leader- ship of Dr. Buyya are recognized through the award of 2009 IEEE TCSC Medal for Excellence in Scalable Computing from the IEEE Computer Society TCSC. Manjrasoft’s Aneka Cloud technology that was developed under his leadership has received 2010 Frost & Sullivan New Product Innovation Award. Recently, Manjrasoft has been recognized as one of the Top 20 Cloud Computing Companies by the Silicon Review Magazine. He served as the foundation editor-in-chief of “IEEE Transactions on Cloud Computing”. He is currently serving as co-editor-in-chief of Journal of Software: Practice and

  

Experience , which was established 40+ years ago. For further information on Dr. Buyya, please visit

his cyberhome: .

  

Dr. Rodrigo N. Calheiros is a research fellow in the Department of Computing and Information

  Systems at The University of Melbourne, Australia. He has made major contributions to the fields of Big Data and cloud computing since 2009. He designed and developed CloudSim, an open source tool for the simulation of cloud platforms used at research centers, universities, and companies worldwide.

  

Dr. Amir Vahid Dastjerdi is a research fellow with the Cloud Computing and Distributed Systems

  (CLOUDS) laboratory at the University of Melbourne. He received his PhD in computer science from the University of Melbourne and his areas of interest include Internet of Things, Big Data, and cloud computing.

  Preface

  Rapid advances in digital sensors, networks, storage, and computation, along with their availability at low cost, are leading to the creation of huge collections of data. Initially, the drive for generation and storage of data came from scientists; telescopes and instruments such as the Large Hadron Collider (LHC) generate a huge amount of data that needed to be processed to enable scientific discovery. LHC, for example, was reported as generating as much as 1 TB of data every second. Later, with the popular- ity of the SMAC (social, mobile, analytics, and cloud) paradigm, enormous amount of data started to be generated, processed, and stored by enterprises. For instance, Facebook in 2012 reported that the company processed over 200 TB of data per hour. In fact, SINTEF (The Foundation for Scientific and Industrial Research) from Norway reports that 90% of the world’s data generated has been generated in the last 2 years. These were the key motivators towards the Big Data paradigm.

  Unlike traditional data warehouses that rely in highly structured data, this new paradigm unleashes the potential of analyzing any source of data, whether structured and stored in relational databases; semi-structured and emerging from sensors, machines, and applications; or unstructured obtained from social media and other human sources.

  This data has the potential to enable new insights that can change the way business, science, and governments deliver services to their consumers and can impact society as a whole. Nevertheless, for this potential to be realized, new algorithms, methods, infrastructures, and platforms are required that can make sense of all this data and provide the insights while they are still of interest for analysts of diverse domains.

  This has led to the emergence of the Big Data computing paradigm focusing on the sensing, col- lection, storage, management and analysis of data from variety of sources to enable new value and insights. This paradigm enhanced considerably the capacity of organizations to understand their activities and improve aspects of its business in ways never imagined before; however, at the same time, it raises new concerns of security and privacy whose implications are still not completely understood by society.

  To realize the full potential of Big Data, researchers and practitioners need to address several chal- lenges and develop suitable conceptual and technological solutions for tackling them. These include life-cycle management of data; large-scale storage; flexible processing infrastructure; data modeling; scalable machine learning and data analysis algorithms; techniques for sampling and making trade-off between data processing time and accuracy and dealing with privacy and ethical issues involved in data sensing, storage, processing, and actions.

  This book addresses the above issues by presenting a broad view of each of the issues, identifying challenges faced by researchers and opportunities for practitioners embracing the Big Data paradigm.

ORGANIZATION OF THE BOOK

  This book contains 18 chapters authored by several leading experts in the field of Big Data. The book is presented in a coordinated and integrated manner starting with Big Data analytics methods, going through the infrastructures and platforms supporting them, aspects of security and privacy, and finally, applications.

  Preface xxii

  The content of the book is organized into four parts: I.

  Big Data Science II. Big Data Infrastructures and Platforms III.

  Big Data Security and Privacy IV. Big Data Applications

  PART I: BIG DATA SCIENCE Data Science is a discipline that emerged in the last few years, as did the Big Data concept. Although

  there are different interpretations of what Data Science is, we adopt the view that Data Science is a discipline that merges concepts from computer science (algorithms, programming, machine learning, and data mining), mathematics (statistics and optimization), and domain knowledge (business, applica- tions, and visualization) to extract insights from data and transform it into actions that have an impact in the particular domain of application. Data Science is already challenging when the amount of data enables traditional analysis, which thus becomes particularly challenging when traditional methods lose their effectiveness due to large volume and velocity in the data.

  Part I presents fundamental concepts and algorithms in the Data Science domain that address the issues rose by Big Data. As a motivation for this part and in the same direction as what we discussed so far,

Chapter 1 discusses how what is now known as Big Data is the result of efforts in two distinct areas, namely machine learning and cloud computing. The velocity aspect of Big Data demands analytic algorithms that can operate data in motion, ie,

  algorithms that do not assume that all the data is available all the time for decision making, and deci- sions need to be made “on the go,” probably with summaries of past data. In this direction,

  discusses real-time processing systems for Big Data, including stream processing platforms that enable analysis of data in motion and a case study in finance.

  The volume aspect of data demands that existing algorithms for different analytics data are adapted to take advantage of distributed systems where memory is not shared, and thus different machines have

Chapter 3 discusses how it affects natural language processing, text mining, and anomaly detection in the context of social media. A concept that emerged recently benefiting from Big Data is deep learning. The approach, derived

  from artificial neural networks, constructs layered structures that hold different abstractions of the same data and has application in language processing and image analysis, among others.

   discusses GPU-based algorithms for graph processing. PART II: BIG DATA INFRASTRUCTURES AND PLATFORMS Although part of the Big Data revolution is enabled by new algorithms and methods to handle large

  amounts of heterogeneous data in movement and at rest, all of this would be of no value if comput-

  Preface xxiii

  different abstractions for programmers arose that enable problems to be represented in different ways. Thus, instead of adapting the problem to fit a programming model, developers are now able to select the abstraction that is closer to the problem at hand, enabling faster more correct software solutions to be developed. The same revolution observed in the computing part of the analytics is also observed in the storage part; in the last years, new methods were developed and adopted to persist data that are more flexible than traditional relational databases.

  Part II of this book is dedicated to such infrastructure and platforms supporting Big Data. Starting with database support,

Chapter 6 discusses the different models of NOSQL database models and sys-

  tems that are available for storage of large amounts of structured, semi-structured and structured data, including key-value, column-based, graph-based, and document-based stores.

  As the infrastructures of choice for running Big Data analytics are shared (think of clusters and clouds), new methods were necessary to rationalize the use of resources so that all applications get their fair share of resources and can progress to a result in a reasonable amount of time. In this direction,

  

-

  ents a novel technique for increasing resource usage and performance of Big Data platforms by applying

Chapter 9 contains a survey on various techniques for optimi- zation of many aspects of the Hadoop framework, including the job scheduler, HDFS, and Hbase. Whereas the previous three chapters focused on distributed platforms for Big Data analytics, paral-

  lel platforms, which rely on many computing cores sharing memory, are also viable platforms for Big Data analytics. In this direction,

Chapter 10 discusses an alternative solution that is optimized to take advantage of the large amount of memory and large number of cores available in current servers. PART III: BIG DATA SECURITY AND PRIVACY For economic reasons, physical infrastructures supporting Big Data are shared. This helps in rational-

  izing the huge costs involved in building such large-scale cloud infrastructures. Thus, whether the infrastructure is a public cloud or a private cloud, multitenancy is a certainty that raises security and privacy concerns. Moreover, the sources of data can reveal many things about its source; although many times sources will be applications and the data generated is in public domain, it is also possible that data generated by devices and actions of humans (eg, via posts in social networks) can be analyzed in a way that individuals can be identified and/or localized, an issue that also raises privacy issues. Part III of this book is dedicated to such security and privacy issues of Big Data.

   methods to infer the location of mobile devices and to estimate human behavior in shopping activities. PART IV: BIG DATA APPLICATIONS All the advances in methods and platforms would be of no value if the capabilities offered them did

  Preface xxiv

  case, and a range of applications in the most diverse areas were developed to fulfill the goal of deliver- ing value via Big Data analytics. These days, financial institutions, governments, educational institu- tions, and researchers, to name a few, are applying Big Data analytics on a daily basis as part of their business as usual tasks. Part IV of this book is dedicated to such applications, featuring interesting use cases of the application of Big Data analytics.

  Social media arose in the last 10 years, initially as a means to connect people. Now, it has emerged as a platform for businesses purposes, advertisements, delivery of news of public interest, and for people to express their opinions and emotions.

Chapter 14 introduces an application in this context,

  namely a Big Data framework for mining opinion from social media in Thailand. In the same direction,

  

presents

  a case study on application of Big Data Analytics in the energy sector; the chapter shows how data generated by smart distribution lines (smart grids) can be analyzed to enable identification of faults in the transmission line. e-Science is one of the first applications driving the Big Data paradigm in which scientific discovery are enabled by large-scale computing infrastructures. As clusters and grids became popular among re- search institutions, it became clear that new discoveries could be made if these infrastructures were put to work to crunch massive volumes of data collected from many scientific instruments. Acknowledging the importance of e-Science as a motivator for a substantial amount of innovation in the field leading

Chapter 18 concludes with various e-Science applications and key elements of their deployment in a cloud environment.

  Acknowledgments

  We thank all the contributing authors for their time, effort, and dedication during the preparation of this book.

  Raj would like to thank his family members, especially his wife, Smrithi, and daughters, Soumya and Radha Buyya, for their love, understanding, and support during the preparation of this book. Rodrigo would like to thank his wife, Kimie, his son, Roger, and his daughter, Laura. Amir would like to thank his wife, Elly, and daughter, Diana.

  Finally, we would like to thank the staff at Morgan Kauffman, particularly, Amy Invernizzi, Brian Romer, Punitha Govindaradjane, and Todd Green for managing the publication in record time.

  Rajkumar Buyya The University of Melbourne and Manjrasoft Pty Ltd, Australia

  Rodrigo N. Calheiros The University of Melbourne, Australia

  Amir Vahid Dastjerdi The University of Melbourne, Australia

  CHAPTER BIG DATA ANALYTICS = MACHINE LEARNING + CLOUD COMPUTING

  1 C. Wu, R. Buyya, K. Ramamohanarao

1.1 INTRODUCTION

  Although the term “Big Data” has become popular, there is no general consensus about what it really means. Often, many professional data analysts would imply the process of extraction, transformation, and load (ETL) for large datasets as the connotation of Big Data. A popular description of Big Data is based on three main attributes of data: volume, velocity, and variety (or 3Vs). Nevertheless, it does not capture all the aspects of Big Data accurately. In order to provide a comprehensive meaning of Big Data, we will investigate this term from a historical perspective and see how it has been evolving from yesterday’s meaning to today’s connotation.

  Historically, the term Big Data is quite vague and ill defined. It is not a precise term and does not carry a particular meaning other than the notion of its size. The word “big” is too generic; the ques- tion how “big” is big and ho

   ] is relative to time, space, and circumstance. From

  an evolutionary perspective, the size of “Big Data” is always evolving. If we use the current global Internet traff 12 ] as a measuring stick, the meaning of Big Data volume would lie between 40 21 70 the terabyte (TB or 10 or 2 ) and zettabyte (ZB or 10 or 2 ) range. Based on the historical data traffic growth rate, Cisco claimed that humans have entered the ZB era in 2015 [

  o understand the

  significance of the data volume’s impact, let us glance at the average size of different data files shown

  The main aim of this chapter is to provide a historical view of Big Data and to argue that it is not 2 just 3Vs, but rather 3 Vs or 9Vs. These additional Big Data attributes reflect the real motivation behind Big Data analytics (BDA). We believe that these expanded features clarify some basic questions about the essence of BDA: what problems Big Data can address, and what problems should not be confused as BDA. These issues are covered in the chapter through analysis of historical developments, along with associated technologies that support Big Data processing. The rest of the chapter is organized into eight sections as follows:

  1) A historical review for Big Data 2) Interpretation of Big Data 3Vs, 4Vs, and 6Vs 2 3) Defining Big Data from 3Vs to 3 Vs 4) Big Data and Machine Learning (ML) 5) Big Data and cloud computing

  4

CHAPTER 1 BDA = ML + CC

  Table 1 Typical Size of Different Data Files Media Average Size of Data File Notes (2014) Web page 1.6–2 MB Average 100 objects eBook 1–5 MB 200–350 pages

  Song 3.5–5.8 MB Average 1.9 MB/per minute (MP3) 256 Kbps rate (3 mins)

Movie 100–120 GB 60 frames per second (MPEG-4 format, Full High Definition, 2 hours)

6) Hadoop, Hadoop distributed file system (HDFS), MapReduce, Spark, and Flink 7) ML + CC (Cloud Computing) → BDA and guidelines

8) Conclusion

1.2 A HISTORICAL REVIEW OF BIG DATA

  In order to capture the essence of Big Data, we provide the origin and history of BDA and then propose a precise definition of BDA.

1.2.1 THE ORIGIN OF BIG DATA

  Several studies have been conducted on the historical views and developments in the BDA area. Gil

   ] provided a short history of Big Data starting from 1944, which was based on Rider’s work

  

  

]. He covered 68 years of history of evolution of Big Data between 1944 and 2012 and illustrated 32

  Big Data-related events in recent data science history. As Press indicated in his article, the fine line be- tween the growth of data and Big Data has become blurred. Very often, the growth rate of data has been referred as “information explosion”; although “data” and “information” are often used interchangeably, the two terms have different connotations. Press’ study is quite comprehensive and covers BDA events up to December 2013. Since then, there have been many relevant Big Data events. Nevertheless, Press’ review did cover both Big Data and data science events. To this extent, the term “data science” could be considered as a complementary meaning to BDA.

  In comparison with Press’ review, Frank Ohlhorst

   ] established the origin of Big Data back to

  1880 when the 10th US census was held. The real problem during the 19th century was a statistics issue, which was how to survey and document 50 million of North American citizens. Although Big Data may contain computations of some statistics elements, these two terms have different interpretations today. Similarly, W

   ] believes the origin of Big Data was in the 19th century. Winshuttle argue if

  data sets are so large and complex and beyond traditional process and management capability, then these data sets can be considered as Big Data. In comparison to Press’ review, Winshuttle’s review emphasizes enterprise resource planning and implementation on cloud infrastructure. Moreover, the review also makes a predication for data growth to 2020. The total time span of the review was more than 220 years. Winshuttle’s Big Data history included many SAP events and its data products, such as HANA.

  The longest span of historical review for Big Data belongs to Bernard Marr’

   ]. He

  traced the origin of Big Data back to 18,000 BCE. Marr argued that we should pay attention to historical

  1.2

  5 A HISTORICAL REVIEW OF BIG DATA

  foundations of Big Data, which are different approaches for human to capture, store, analyze, and re- trieve both data and information. Furthermore, Marr believed that the first person who casted the term “Big Data” w

   ], who presented an article for Harpers Magazine that was subsequently

  reprinted in The Washington Post in 1989 because there were two sentences that consisted of the words of Big Data: “The keepers of Big Data say they do it for the consumer’s benefit. But data have a way of being used for purposes other than originally intended.”

  In contrast, Steve Lohr [

  s view. He argues that just adopting the term alone

  might not have the connotation of today’s Big Data because “The term Big Data is so generic that the hunt for its origin was not just an effort to find an early reference to those two words being used together. Instead, the goal was the early use of the term that suggests its present interpretation — that is, not just a lot of data, but different types of data handled in new ways.” This is an important point. Based on this reasoning, we consider that Cox and Ellsworth [

  

  because they assigned a relatively accurate meaning to the existing view of Big Data, which they stated, “…data sets are generally quite large, taxing the capacities of main memory, local disk and even remote disk. We call this the problem of Big Data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk…” Although today’s term may have an extended meaning as opposed to Cox and Ellsworth’s term, this definition reflects today’s connotation with reasonable accuracy.

  Another historical review was contributed by Visualizing.or

   ]. It focused on the timeline of how

  to implement BDA. Its historical description is mainly determined by events related to the Big Data push by many Internet and IT companies, such as Google, YouTube, Yahoo, Facebook, Twitter, and Apple. It emphasized the significant impact of Hadoop in the history of BDA. It primarily highlighted the significant role of Hadoop in the BDA. Based on these studies, we show the history of Big Data, Hadoop, and its ecosystem in .

  Undoubtedly, there will be many different views based on different interpretations of BDA. This will inevitably lead to many debates of Big Data implication or pros and cons.

1.2.2 DEBATES OF BIG DATA IMPLICATION

   Pros

  There have been many debates regarding Big Data’s pros and cons during the past few years. Many advocates declare Big Data to be a new rock star [

   ] for innovation, competition, and productivity because data is embedded in the modern human being’s life.

  Data that are generated every second by both machines and humans is a byproduct of all other activi- ties. It will become the new epistemologies [

  

  argued that Big Data would revolutionize our way of thinking, working, and living. They believe that a massive quantitative data accumulation will lead to qualitative advances at the core of BDA: ML, paral- lelism, metadata, and predictions: “Big Data will be a source of new economic value and innovation” ]. Their conclusion is that data can speak for itself, and we should let the data speak.

  To a certain extent, Montjoye et al. [

  ve conclusion. They demonstrated that it is

  highly probable (over 90% reliability) to reidentify a person with as few as four spatiotemporal data points (eg, credit card transactions in a shopping mall) by leveraging BDA. Their conclusion is that “large-scale data sets of human behavior have the potential to fundamentally transform the way we fight diseases, design cities and perform research.”

  6

CHAPTER 1 BDA = ML + CC

  1997, The problem of Big Data, NASA researchers, Michael Cox et and David Ellsworth’s paper 1998, Google was founded 1999, Apache Software Foundation (ASF) was established 2000, Doug Cutting launched his indexing search project: Lucene 2000, L Page and S. Brin wrote paper “the Anatomy of a Large-Scale Hyertextual Web search engine” 2001, The 3Vs, Doug Laney’s paper “3D data management: controlling data Volume, Velocity & Variety” 2002, Doug Cutting and Mike Caffarella started Nutch, a subproject of Lucene for crawling websites 2003, Sanjay Ghemawat et al. published “The Google File System”(GFS) 2003, Cutting and Caffarella adopted GFS idea and create Nutch Distribute File System (NDFS) later, it became HDFS 2004, Google Began to develop Big Table

  Solr 2004, Yonik Seeley created for Text-centric, read-dominant, document-oriented & flexible schema search engine 2004, Jeffrey Dean and Sanjay Ghemawat published “Simplified Data Processing on Large Cluster” or MapReduce 2005 Nutch established Nutch MapReduce 2005, Damien Katz created Apache CouchDB (Cluster Of Unreliable Commodity Hardware), former Lotus Notes 2006, Cutting and Cafarella started Hadoop or a subproject of Nutch 2006, Yahoo Research developed Apache Pig run on Hadoop 2007, 10gen, a start-up company worked on Platform as a Service (PaaS). Later, it became MongoDB 2007, Taste project 2008, Apache Hive (extend SQL), HBase (Manage data) and Cassandra(Schema free) to support Hadoop 2008, Mahout, a subproject of Lucene integrated Taste 2008 Hadoop became top level ASF project 2008 TUB and HPI initiated Stratosphere Project and later become Apache Flink 2009, Hadoop combines of HDFS and MapReduce. Sorting one TB 62 secs over 1,460 nodes 2010, Google licenced to ASF Hadoop 2010, Apache Spark , a cluster computing platform extends from MapReduce for in-memory primitives 2011, Apache Storm was launched for a distributed computation framework for data stream 2012, Apache Dill for Schema-Free SQL Query Engine for Hadoop, NoSQL and cloud Storage 2012, Phase 3 of Hadoop – Emergence of “Yet Another Resource Negotiator”(YARN) or Hadoop 2 2013 Mesos became a top level Apache project 2014, Spark has > 465 contributors in 2014, the most active ASF project 2015, Enter Zeta Byte Era FIG. 1 A short history of big data.

   Cons

  In contrast, some argue that Big Data is inconclusive, overstated, exaggerated, and misinformed by the media and that data cannot speak for itself [

  w big the data set is. It could

  be just another delusion because “it is like having billions of monkeys typing, one of them will write Shakespeare” [

  ver judge a decision by its outcome — out-

  come bias.” In other words, if one of the monkeys can type Shakespeare, we cannot conclude or infer- ence that a monkey has sufficient intelligence to be Shakespeare.

  Gary Drenik [

  ved that the sentiment of the overeager adoption of Big Data is more like

  “Extraordinary Popular Delusion and the Madness of Crowds,” the description made by Charles

   ] in his famous book’s title. Psychologically, it is a kind of a crowd emotion that seems to

  have a perpetual feedback loop. Drenik quoted this “madness” with Mackay’s warning: “We find that whole communities suddenly fix their minds upon one subject, and go mad in its pursuit; that millions

  1.3

  7

HISTORICAL INTERPRETATION OF BIG DATA

  by some new folly more captivating than the first.” The issue that Drenik has noticed was “the hype overtaken reality and there was little time to think about” regarding Big Data. The former Obama’s campaign CTO: Harper Reed, had the real story in terms of adoption of BDA. His remarks of Big Data were “literally hard” and “expensive” [

  

  Danah Boyd et al.

   ] are quite skeptical in regarding big data in terms of its volume. They argued

  that bigger data are not always better data from a social science perspective. In responding to “The End of Theory” [

  yd asserted that theory or methodology is still highly relevant for

  today’s statistical inference and “The size of data should fit the research question being asked; in some cases, small is best.” Boyd et al. suggested that we should not pay a lot of attention to the volume of data. Philosophically, this argument is similar to the debate between John Stuart Mill (Mill’s five clas-

   ]. Mill’s critics argued that it is impossible to bear on the

  intelligent question just by ingesting as much as data alone without some theory or hypothesis. This means that we cannot make Big Data do the work of theory.

  Another Big Data critique comes from David Lazer et al.

   ]. They demonstrated that the Google

  flu trends (GFT) prediction is the parable and identified two issues (Big Data hubris and algorithm dynamics) that contributed to GFT’s mistakes. The issue of “Big Data hubris” is that some observers believe that BDA can replace traditional data mining completely. The issue of “algorithm dynamics” is “the changes made by [Google’s] engineers to improve the commercial service and by consumers in us- ing that service.” In other words, the changing algorithms for searching will directly impact the users’ behavior. This will lead to the collected data that is driven by deliberated algorithms. Lazer concluded there are many traps in BDA, especially for social media research. Their conclusion was “we are far from a place where they (BDA) can supplant more traditional methods or theories.”

  All these multiple views are due to different interpretations of Big Data and different implementa- tions of BDA. This suggests that in order to resolve these issues, we should first clarify the definition of the term BDA and then discover the clash point based on the same term.

1.3 HISTORICAL INTERPRETATION OF BIG DATA

  1.3.1 METHODOLOGY FOR DEFINING BIG DATA

  Intuitively, neither yesterday’s data volume (absolute size) nor that of today can be defined as “big.” Moreover, today’s “big” may become tomorrow’s “small.” In order to clarify the term Big Data pre- cisely and settle the debate, we can investigate and understand the functions of a definition based on the

   ).

  Based on Baird or Irving’s approach of definition, we will first investigate the historical definition 2 from an evolutionary perspective (lexical meaning). Then, we extend the term from 3Vs to 9Vs or 3 Vs based on its motivation (stipulative meaning), which is to add more attributes for the term. Finally, we will eliminate ambiguity and vagueness of the term and make the concept more precise and meaningful.

  1.3.2 DIFFERENT ATTRIBUTES OF DEFINITIONS Gartner — 3Vs definition

  Since 1997, many attributes have been added to Big Data. Among these attributes, three of them are the most popular and have been widely cited and adopted. The first one is so called Gartner’s interpretation

  8

CHAPTER 1 BDA = ML + CC

  Irving M. Copi Robert Baird

  1

  1 Lexical Lexical

  2

  2 F F F F F F Functional/stipulative F u u Stipulative

  3

  3 Real Précising

  4

  4 Essential-intuitive Theoretical

  5 Persuasive FIG. 2 Methodology of definition.

  white paper published by Meta group, which Gartner subsequently acquired in 2004. Douglas noticed that due to surging of e-commerce activities, data has grown along three dimensions, namely:

  1. Volume, which means the incoming data stream and cumulative volume of data

  

2. Velocity, which represents the pace of data used to support interaction and generated by interactions

  

3. Variety, which signifies the variety of incompatible and inconsistent data formats and data structures

  According to the history of the Big Data timeline [

  y’s 3Vs definition has been

  widely regarded as the “common” attributes of Big Data but he stopped short of assigning these attri- butes to the term “Big Data.”

   IBM — 4Vs definition

  IBM added another attribute or “V” for “Veracity” on the top of Douglas Laney’s 3Vs notation, which is known as the 4Vs of Big Data. It defines each “V” as following [

   ]:

  1. Volume stands for the scale of data

  2. Velocity denotes the analysis of streaming data

  3. Variety indicates different forms of data

  4. Veracity implies the uncertainty of data

  Zikopoulos et al. explained the reason behind the additional “V” or veracity dimension, which is “in

  response to the quality and source issues our clients began facing with their Big Data initiatives ].

  They are also aware of some analysts including other V-based descriptors for Big Data, such as vari- ability and visibility.

   Microsoft — 6Vs definition

  For the sake of maximizing the business value, Microsoft extended Douglas Laney’s 3Vs attributes to

  

   ], which it added variability, veracity, and visibility:

  1. Volume stands for scale of data

  2. Velocity denotes the analysis of streaming data

  1.3

  9

HISTORICAL INTERPRETATION OF BIG DATA

  4. Veracity focuses on trustworthiness of data sources

  5. Variability refers to the complexity of data set. In comparison with “Variety” (or different data

  format), it means the number of variables in data sets

  6. Visibility emphasizes that you need to have a full picture of data in order to make informative

  decision

   More Vs for big data

  A 5 Vs’ Big Data definition was also proposed by Yuri Demchenko [

  alue

  dimension along with the IBM 4Vs’ defy published 3Vs in 2001, there have been additional “Vs,” even as many as 11 [

  All these definitions, such as 3Vs, 4Vs, 5Vs, or even 11 Vs, are primarily trying to articulate the aspect of data. Most of them are data-oriented definitions, but they fail to articulate Big Data clearly in a relationship to the essence of BDA. In order to understand the essential meaning, we have to clarify what data is.

  Data is everything within the universe. This means that data is within the existing limitation of technological capacity. If the technology capacity is allowed, there is no boundary or limitation for

  Microsoft’s 6Vs Volume

  Douglas Laney’s 3Vs Velocity

  Visibility

  6Vs Volume

  3Vs Value Variety Velocity

  Variety Veracity Yuri Demchenko’s 5Vs

  Volume

  IBM’s 4Vs •Transactions •Tran s •Table/text/file •Records/achieved Volume •Stream data Velocity •Correlations •Hypothetical Value •Batches •Batches • • Real time Real ti •Processes

  4Vs •Events Veracity Velocity •Statistical

  5Vs

  V Ve Veracity V racity

  • •Authenticity •Trustworthiness •Structured Variety

  V Variety •Semi-structured

  • •Accountability •Origin •Multi-factors •Unstructured

  FIG. 3

  10

CHAPTER 1 BDA = ML + CC

  1.3.3 SUMMARY OF 7 TYPES DEFINITIONS OF BIG DATA

  Big Data definitions [