Big Data Management and Processing pdf pdf

  Big Data Management

and Processing

  Edited by Kuan-Ching Li Hai Jiang

  Albert Y. Zomaya Big Data Series

  Big Data Management and Processing

  Big Data Management and Processing Edited by Kuan-Ching Li Guangzhou University, China Providence University, Taiwan

  Hai Jiang Arkansas State University, USA

  Albert Y. Zomaya University of Sydney, Australia CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 c 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business

  No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-4987-6807-8 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been

made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity

of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright

holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this

form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we

may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or

utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-

tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission

from the publishers.

For permission to photocopy or use material electronically from this work, please access ) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers,

MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of

users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been

arranged.

  

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for

identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at and the CRC Press Web site at

  Contents

  

   Paolo Balboni and Theodora Dragan

  

  

  

  

   Junwhan Kim, Roberto Palmieri, and Binoy Ravindran

   Guillaume Aupy, Anne Benoit, Loic Pottier, Padma Raghavan, Yves Robert, and Manu Shantharam

  

   Xiongpai Qin and Keqin Li

  Contents

   Miyuru Dayarathna, Paul Fremantle, Srinath Perera, and Sriskandarajah Suhothayan

   Vito Giovanni Castellana, Antonino Tumeo, Marco Minutoli, Marco Lattuada, and Fabrizio Ferrandi

Chapter 15 Complex Mining from Uncertain Big Data in Distributed Environments: Problems, Definitions, and Two Effective and Efficient Algorithms . . . . . . . . . . . . . . . 297 Alfredo Cuzzocrea, Carson Kai-Sang Leung, Fan Jiang,

  and Richard Kyle MacKinnon

  

  

   Big Data Management and Processing

  (edited by Li, Jiang, and Zomaya) is a state-of-the-art book that deals with a wide range of topical themes in the field of Big Data. The book, which probes many issues related to this exciting and rapidly growing field, covers processing, management, analytics, and applications.

  The many advances in Big Data research that we witness today are brought about because of the many developments we see in algorithms, high-performance computing, databases, datamining, machine learning, and so on. These developments are discussed in this book. The book also show- cases some of the interesting applications and technologies that are still evolving and that will lead to some serious breakthroughs in the coming few years.

  I believe that Big Data Management and Processing is a very valuable addition to the literature. It will serve as a source of up-to-date research in this continuously developing area. The book also provides an opportunity for researchers to explore the use of advanced computing technologies and their impact on enhancing our capabilities to conduct more sophisticated studies.

  I expect that Big Data Management and Processing will be well received by the research and development community. It should prove very beneficial for researchers and graduate students focusing on Big Data and will serve as a very useful reference for practitioners and application developers.

  Sartaj Sahni University of Florida

  

  The scope of Big Data today spans many aspects and it is not limited to main computing components (e.g., processors, storage devices, and visualization facilities) alone, but it expands into a much larger range of issues related to management and policy. Also, “Big Data” can mean “Big Energy,” because of the pressure that data places on a variety of infrastructures needed to host, manage, and transport data. This in turn raises various monetary, environmental, and system performance concerns.

  Recent advances in software hardware technologies have improved the handling of big data. How- ever, there still remain many issues that are pertinent to the overloading that happens due to the processing of massive amounts of data, which calls for the development of various software and hardware solutions as well as new algorithms that are more capable of processing of data.

  This book, Big Data Management and Processing, seeks to provide an opportunity for researchers to explore a range of big data-related issues and their impact on the design of new computing systems. The book is quite timely, since the field of big data computing as a whole is undergoing rapid changes on a daily basis. Vast literature exists today on such data processing paradigms and frameworks and their implications for a wide range of distributed platforms.

  The book is intended to be a virtual roundtable of several outstanding researchers that one might invite to attend a conference on big data computing systems. Of course, the list of topics that is explored here is by no means exhaustive, but most of the conclusions provided here should be extended to the other computing platforms that are not covered here. There was a decision to limit the number of chapters while providing more pages for contributed authors to express their ideas, so that the book remains manageable within a single volume.

  It is also hoped that the topics covered will get the readers to think of the implications of such new ideas on the developments in their own fields. The book endeavors to strike a balance between theoretical and practical coverage of innovative problem-solving techniques for a range of platforms. The book is intended to be a repository of paradigms, technologies, and applications that target the different facets of big data computing systems.

  The 21 chapters are carefully selected to provide a wide scope with minimal overlap between the chapters so as to reduce duplications. Each contributor was asked that his/her chapter should cover review material as well as current developments. In addition, the choice of authors was made so as to select authors who are leaders in the respective disciplines.

  

  First and foremost we would like to thank and acknowledge the contributors to this volume for their support and patience, and the reviewers for their useful comments and suggestions that helped in improving the earlier outline of the book and presentation of the material. Also, we extend our deepest thanks to Randi Cohen from CRC Press (USA) for his collaboration, guidance, and most importantly, patience in finalizing this handbook. Finally, we would like to acknowledge the efforts of the team from CRC Press’s production department for their extensive efforts during the many phases of this project and the timely fashion in which the book was produced.

   Kuan-Ching Li

  is a professor with appointments at the Guangzhou University, China and Providence University, Taiwan. He is a recipient of awards from Nvidia and support from a num- ber of industrial companies. He has also received guest and distinguished chair professorships from universities in China and other countries. He has been actively involved in numerous conferences and workshops in program/general/steering conference chairman positions and as a program com- mittee member, and has organized numerous conferences related to high-performance computing and computational science and engineering.

  Professor Li is the Editor-in-Chief of technical publications such as International Journal of Computational Science and Engineering (IJCSE), International Journal of Embedded Systems (IJES), and International Journal of High Performance Computing and Networking (IJHPCN), all published by Inderscience. He also serves as an editorial board member and a guest editor for a number of journals. In addition, he is the author or editor of several technical professional books published by CRC Press, Springer, McGraw-Hill, and IGI Global. His topics of interest include GPU/manycore computing, big data, and cloud. He is a Member of the AAAS, a Senior Member of the IEEE, and a Fellow of the IET.

  Hai Jiang is a professor in the Department of Computer Science at Arkansas State University, USA.

  He received his BS degree from Beijing University of Posts and Telecommunications, China, and his MA and PhD degrees from Wayne State University, Detroit, Michigan, USA. His current research interests include parallel and distributed systems, computer and network security, high-performance computing and communication, big data, and modeling and simulation. He has published one book and several research papers in major international journals and conference proceedings. He has served as a U.S. National Science Foundation proposal review panelist and a U.S. DoE (Department of Energy) Smart Grid Investment Grant (SGIG) reviewer multiple times.

  Professor Jiang serves as the executive editor of International Journal of High Performance

  Computing and Networking

  (IJHPCN). He is an editorial board member of International Journal

  of Big Data Intelligence

  (IJBDI), The Scientific World Journal (TSWJ), Open Journal of Internet

  of Things

  (OJIOT), and GSTF Journal on Social Computing (JSC) and a guest editor of IEEE Sys-

  tems Journal

  , International Journal of Ad Hoc and Ubiquitous Computing, Cluster Computing, and

  The Scientific World Journal

  for multiple special issues. He has also served as a general or pro- gram chair for some major conferences/workshops (CSE, HPCC, ISPA, GPC, ScaleCom, ESCAPE, GPU-Cloud, FutureTech, GPUTA, FC, SGC). He has been involved in more than 90 conferences and workshops as a session chair or program committee member, including major conferences such as AINA, ICPP, IUCC, ICPADS, TrustCom, HPCC, GPC, EUC, ICIS, SNPD, TSP, PDSEC, SECRUPT, and ScalCom. He is a professional member of ACM and IEEE Computer Society and a representa- tive of the U.S. NSF XSEDE (Extreme Science and Engineering Discovery Environment) Campus Champion for Arkansas State University.

  

Albert Y. Zomaya is the chair professor of high-performance computing and networking in the

  School of Information Technologies, University of Sydney, Australia and also serves as the director of the Centre for Distributed and High Performance Computing. He has published more than 600 scientific papers and articles and is the author, coauthor, or editor of more than 20 books. He is the founding editor-in-chief of IEEE Transactions on Sustainable Computing and serves as an associate editor for more than 20 leading journals. He served as the editor-in-chief of IEEE Transactions on Computers from 2011 to 2014.

  Editors

  Professor Zomaya is the recipient of the IEEE Technical Committee on Parallel Processing Out- standing Service Award (2011), the IEEE Technical Committee on Scalable Computing Medal for Excellence in Scalable Computing (2011), and the IEEE Computer Society Technical Achievement Award (2014). He is a chartered engineer and a fellow of AAAS, IEEE, and IET. His research interests are in the areas of parallel and distributed computing and complex systems.

   Syedmeysam Abolghasemi

  Min Chen

  University of Wollongong Wollongong, NSW, Australia

  Jianguo Chen

  College of Computer Science and Electronic Engineering

  Hunan University Changsha, Hunan, China

  Jinjun Chen

  Swinburne Data Science Research Institute Swinburne University of Technology Australia

  Department of Computer Science State University of New York New Paltz, New York

  Huaming Chen

  Alfredo Cuzzocrea

  DIA Department University of Trieste and ICAR-CNR Trieste, Italy

  Monica Ferreira da Silva

  The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

  Miyuru Dayarathna WSO2 Inc.

  Mountain View, California

  Theodora Dragan

  School of Computing and Information Technology

  High Performance Computing Pacific Northwest National Laboratory Richland, Washington

  Department of Computer Science Old Dominion University Norfolk, Virginia

  Tilburg, The Netherlands and

  Antonio Juarez Alencar

  The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

  Guillaume Aupy

  School of Engineering Vanderbilt University Nashville, Tennessee

  Paolo Balboni

  Tilburg Institute for Law, Technology, and Society

  ICT Legal Consulting Milan, Italy and European Privacy Association Brussels, Belgium

  Vito Giovanni Castellana

  Mauro Penha Bastos

  The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

  Anne Benoit

  LIP, ENS Lyon Lyon, France

  Angelos Bilas

  Institute of Computer Science (ICS) Foundation for Research and Technology—

  Hellas (FORTH) and Department of Computer Science University of Crete, Greece

  European Privacy Association Brussels, Belgium

  Contributors Fabrizio Ferrandi

  Clarkson University Potsdam, New York

  Marco Lattuada

  Dipartimento di Elettronica, Informazione e Bioingegneria

  Politecnico di Milano Milano, Italy

  Carson Kai-Sang Leung

  Department of Computer Science University of Manitoba Winnipeg, MB, Canada

  Boyang Li

  Department of Electrical and Computer Engineering

  Kenli Li

  Junwhan Kim

  College of Computer Science and Electronic Engineering

  Hunan University and National Supercomputing Center in Changsha Changsha, Hunan, China

  Keqin Li

  College of Computer Science and Electronic Engineering

  Hunan University Changsha, Hunan, China and Department of Computer Science State University of New York New Paltz, New York

  Norman Lim

  Department of Systems and Computer Engineering

  CSIT University of the District of Columbia Washington, DC

  Department of Computer Science University of Manitoba Winnipeg, MB, Canada

  Dipartimento di Elettronica, Informazione e Bioingegneria

  Technology—Hellas (FORTH) Greece

  Politecnico di Milano Milano, Italy

  Ryan Florin

  Department of Computer Science Old Dominion University Norfolk, Virginia

  Paul Fremantle WSO2 Inc.

  Mountain View, California

  Pilar Gonz´alez-F´erez

  Department of Computer Engineering Technology University of Murcia Murcia, Spain and Institute of Computer Science (ICS) Foundation for Research and

  Chonglin Gu

  Fan Jiang

  Department of Computer Science and Technology

  Harbin Institute of Technology Shenzhen, China

  Hejiao Huang

  Department of Computer Science and Technology

  Harbin Institute of Technology Shenzhen, China

  Xiaohua Jia

  Department of Computer Science and Technology

  Harbin Institute of Technology Shenzhen, China

  Carleton University Ottawa, ON, Canada Contributors Nam Ling

  Department of Computer Engineering Santa Clara University Santa Clara, California

  School of Computing and Communications

  Mountain View, California

  Florin Pop

  Department of Computer Science

  University Politehnica of Bucharest

  Bucharest, Romania

  Loic Pottier

  LIP, ENS Lyon Lyon, France

  Deepak Puthal

  University of Technology Sydney, Australia

  ECE, Virginia Tech Blacksburg, Virginia

  Xiongpai Qin

  Information School Renmin University of China Beijing, China

  Padma Raghavan

  School of Engineering Vanderbilt University Nashville, Tennessee

  Rajiv Ranjan

  School of Computing Science Newcastle University United Kingdom

  Binoy Ravindran

  ECE, Virginia Tech Blacksburg, Virginia

  Yves Robert

  Srinath Perera WSO2 Inc.

  Roberto Palmieri

  Chen Liu

  Shikharesh Majumdar

  Department of Electrical and Computer Engineering

  Clarkson University Potsdam, New York

  Yuhong Liu

  Department of Computer Engineering Santa Clara University Santa Clara, California

  Simone A. Ludwig

  Department of Computer Science North Dakota State University Fargo, North Dakota

  Richard Kyle MacKinnon

  Department of Computer Science University of Manitoba Winnipeg, MB, Canada

  Department of Systems and Computer Engineering

  Department of Computer Science Old Dominion University

  Carleton University Ottawa, Ontario, Canada

  Marco Minutoli

  High Performance Computing Pacific Northwest National

  Laboratory Richland, Washington

  Rim Moussa

  LaTICE Lab University of Tunis and ENICarthage Tunis, Tunisia

  Surya Nepal

  CSIRO Data61 Australia

  Stephan Olariu

  LIP, ENS Lyon Lyon, France and University of Tennessee

  Contributors Soror Sahri

  Chengwen Wu

  Mihaela-Andreea Vasile

  Department of Computer Science University Politehnica of Bucharest Bucharest, Romania

  Lei Wang

  School of Computing and Information Technology

  University of Wollongong Wollongong, NSW, Australia

  Yu Wang

  Department of Computer Engineering Santa Clara University Santa Clara, California

  Department of Computer Science and Technology

  High Performance Computing Pacific Northwest National

  Tsinghua University Beijing, China

  Aida Ghazi Zadeh

  Department of Computer Science Old Dominion University Norfolk, Virginia

  Guangyan Zhang

  Department of Computer Science and Technology

  Tsinghua University Beijing, China

  Weimin Zheng

  Department of Computer Science and Technology

  Laboratory Richland, Washington

  Antonino Tumeo

  LIPADE Lab University Rene Descartes Paris, France

  Jiangning Song

  Eber Assis Schmitz

  The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

  Manu Shantharam

  Computational Research Scientist San Diego Supercomputer Center San Diego, California

  Jun Shen

  School of Computing and Information Technology

  University of Wollongong Wollongong, NSW, Australia

  Department of Biochemistry and Molecular Biology

  Hunan University Changsha, Hunan, China

  Monash University Clayton, Victoria, Australia

  Petros Sotirios Stefaneas

  Department of Mathematics School of Applied Mathematics and

  Physical Sciences National Technical University of Athens Athens, Greece

  Sriskandarajah Suhothayan WSO2 Inc.

  Mountain View, California

  Zhuo Tang

  College of Computer Science and Electronic Engineering

  Tsinghua University Beijing, China

   Paolo Balboni and Theodora Dragan

  8 1.3.1 Traditional Data Protection Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Big Data Management and Processing

  The overlap between big data and personal data is becoming increasingly relevant in today’s society, in light of the technological developments and, in particular, of the increased use of personal data as currency for purchasing “free” services. The global nature of big data, coupled with recently devel- oped data analytics and the interest of companies in predicting trends and consumer preferences, makes it necessary to analyze how personal data and big data are connected. With a focus on the quality of data as fundamental prerequisite for ensuring that outcomes are accurate and relevant, the authors explore the ways in which traditional and modern personal data protection principles apply to the big data context.

  ABSTRACT

  1.4 Conclusions and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

  1.3.2.3 Users’ Control of Their Own Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

  1.3.2.2 Privacy by Design and by Default. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

  1.3.2.1 Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

  1.3.2 Modern Data Protection Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

  1.3.1.2 Proportionality and Purpose Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

  9

  9 1.3.1.1 Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  7 1.3 Reconciling Traditional and Modern Data Protection Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  CONTENTS Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  6 1.2.2 Competition Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  6 1.2.1.4 Natural Person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  6 1.2.1.3 Identified or Identifiable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  5 1.2.1.2 Relating to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  5 1.2.1.1 Any Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  4 1.2.1 Connection between Big Data and Personal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  4 1.2 Business of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  2 1.1.2 Structure and Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  2 1.1.1 Topic, Approach, and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  It is not about the quantity of the data, but about the quality of it! All websites were last accessed on August 19, 2016.

1.1 INTRODUCTION

  It is 2016 and big data is everywhere: in the newspapers, on TV, in research papers, and on the lips of every IT specialist. This is not only due to its catchy name, but also due to the sheer quantity of data

  18

  • ) available—according to IBM, we create 2.5 quintillion (2.5 times 10 bytes of data every day. But what is the big deal with big data and, in particular, to what extent does it affect, or overlap with, personal data?

OPIC PPROACH, AND ETHODOLOGY

1.1.1 T , A M

  By way of introduction, the first step is to provide a definition of the concept that runs through this chapter. Various attempts at defining big data have been made in recent years, but no universal definition has been agreed upon yet. This is likely due to the constant evolution of this concept, which makes it difficult to describe without risking that the definition is either too generic or that it becomes inadequate within a short period of time.

  One attempt at a universal definition was made by Gartner, a leading information technology research and advisory company, that defines big data as “high-volume, high-velocity and/or high- variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. ” In this case, data are regarded as assets, which attaches an intrinsic value to it. On the other hand, the Article 29 Data Pro- tection Working Party defines big data as “the exponential growth both in the availability and in the automated use of information: it refers to gigantic digital datasets held by corporations, governments This definition regards big data as a phenomenon composed of both the process of collecting information and the subsequent step of analyzing it. The common elements of the different definitions are there- fore the size of the database and the analytical aspect, which together are expected to lead to better, more focused services and products, as well as more efficient business operations and more targeted approaches.

  Big data can be (and has been) used in an incredibly diverse range of situations. It was employed to help athletes of Great Britain’s rowing team achieve superior performance levels at the 2016 Olympic Games in Rio de Janeiro, by analyzing relevant information about their predecessors’ performance .

  Predictive analytics were used in order to deal with traffic in highly congested cities, paving the way Further, big data can have a great impact on medical sciences, and has already helped boost obesity research results by enabling researchers to identify links between obesity and depression that were previously unknown .

  Although big data does not always consist of personal data and could, for example, relate to techni- cal information or to information about objects or natural phenomena, the European Data Protection Supervisor (EDPS) pointed out in its Opinion 7/2015 that “one of the greatest values of big data for businesses and governments is derived from the monitoring of human behaviour, collectively and

  IBM—What Is Big Data? 2016. IBM—Bringing Big Data to the Enterprise. .

What Is Big Data?—Gartner IT Glossary—Big Data. 2012. Gartner IT Glossary. Article 29 Data Protection Working Party. 2013. Opinion 03/2013 on Purpose Limitation.

  

Marr, Bernard. 2016. How Can Big Data and Analytics Help Athletes Win Olympic Gold in Rio 2016? Forbes.com.

.

  

Toesland, Finbarr. 2016. Smart-from-the-Start Cities Is the Way Forward. Raconteur.

  

Big Data Boosts Obesity Research Results Big Data

  • individually.” Analyzing and predicting human behavior enables decision makers in many areas to

  make decisions that are more accurate, consistent, and economical, thereby enhancing the efficiency of society as a whole. A few fields of application that immediately come to mind when thinking of big data analytics based on personal data are university admissions, job recruitment, customer profiling, targeted marketing, or health services. Analyzing the information about millions of previ- ous applicants, candidates, customers, or patients makes it easy to establish common threads and to predict all sorts of things, such as whether a specific person is fit for the job or is likely to develop a certain disease in the future.

  An interesting study was recently conducted by the University of Cambridge Psychometrics Cen- tre: by analyzing the social networking “likes” of 58,000 users, researchers found that they were able to predict ethnic origin with an accuracy of 95% and religious or political orientation with an accu- Even more dramatically perhaps, they were able to predict psychological traits such as intelligence or emotional stability. The research was conducted using openly available data provided by the study subjects themselves (Facebook likes). Its results can be fine-tuned even fur- ther when cross-referencing them with data about the same subjects drawn from other sources, such as other social networking profiles or Internet usage habits. This is the point where big data starts overlapping with personal data, being separated only by a blurry border: “liking” a specific rock band does not constitute personal data as such, but the ability of linking this information directly to an individual or to other information makes it possible to identify what the person actually likes; furthermore, it enables to draw inferences about their personality, possibly revealing even sensitive political or religious preference (as was the case in the Cambridge study). “Companies may consider most of their data to be non personal data sets, but in reality it is now rare for data generated by user activity to be completely and irreversibly anonymised,” stated the EDPS in a recent Opinion. The availability of massive amounts of data from different sources combined with the desire to learn more about people’s habits therefore poses a serious challenge regarding the right to privacy of the individual and requires that the data protection principles are carefully taken into consideration.

  A fundamental part of big data analytics, however, is that the raw data must be accurate in order to lead to accurate results; massive quantities of inaccurate data can lead to skewed results and poor decision making. Bruce Schneier, an internationally renowned security technologist, refers to this as the “pollution problem of the information age.” There is a risk that analytical applications find patterns in cases where the individual facts are not directly correlated, which may lead to unfair conclusions and may adversely affect the persons involved. Another risk is that of being trapped in an “information bubble,” with people only being shown certain information that has been predicted to be of interest to them (but may not be in reality). In an article published in 2015 by TIME magazine, Facebook’s newsfeed algorithm was explained: whereas users have access to an average of 1,500 posts per day, they only see about 300 of them, which have been preselected by an algorithm in order to correspond as much as possible with the interests and preferences of each user. The author of the article concludes that “by structuring the environment, Facebook is training people implicitly to behave in a particular way in that algorithmic environment.” Therefore, data quality is paramount

  European Data Protection Supervisor. 2015. Opinion 7/2015—Meeting the Challenges of Big Data: A Call for Transparency,

User Control, Data Protection by Design and Accountability . Available at: .

  

Kosinski, M., D. Stillwell, and T. Graepel. 2013. Private Traits and Attributes Are Predictable from Digital Records of

Human Behavior. Proceedings of the National Academy of Sciences 110 (15): 5802–5805. doi: 10.1073/pnas.1218772110.

European Data Protection Supervisor. 2014. Preliminary Opinion of the European Data Protection Supervisor

Privacy and Competitiveness in the Age of Big Data: The Interplay between Data Protection, Competition

Law and Consumer Protection in the Digital Economy .

Schneier, Bruce. 2015. Data and Goliath. New York: W.W. Norton.

  

Here’s How Your Facebook News Feed Actually Works. 2015. TIME.Com.

  Big Data Management and Processing

  to ensuring that the algorithms and analytical procedures are carried out successfully and that the predicted results correspond with the reality.

  This chapter is aimed at analyzing the personal data protection legal compliance aspects of big data from a modern perspective, in order to identify the main challenges and to make adequate rec- ommendations for the more efficient and lawful use of data as an asset. Few considerations are also made on the connection between big personal data analytics and competition law. The methodology is straightforward: the observations made throughout the chapter are based on the research conducted by regulatory and advisory bodies, as well as on the empirical research and practical experience of the authors. One of the chapter’s focal points is data quality. Owing to the nature of big data, raw data that are not of adequate quality (accurate, relevant, consistent, and complete) represent an obstacle in harnessing the value of the data. It is hoped that the chapter will enable the reader to gain a bet- ter understanding that a correct legal compliance management can make a fundamental difference between simply collecting vast amount of data, on the one hand, and effectively using the power of big data, on the other hand.

1.1.2 S TRUCTURE AND A RGUMENTS

  This chapter is organized into two main sections: the first one addresses the personal data aspects of big data from a business perspective and is aimed at identifying the benefits and challenges of using big data analytics on massive personal datasets. The second part deals in detail with how the tradi- tional data protection principles should be applied to big data analytics, while also tackling modern data protection principles. Overall, the chapter aims to serve as a good basis for understanding both the positive and the negative implications of deploying big data analytics on personal datasets. In addition, the chapter will focus on the importance of the quality of the data analyzed, on the different ways in which good levels of data quality can be achieved, and on the negative consequences that may ensue when they are not.

1.2 BUSINESS OF BIG DATA

  It is by now clear: big data means big business. Data are frequently called “the oil of the 21st century” or “the fuel of the digital economy,” and the era we live in has been referred to as the “data gold rush” by Neelie Kroes, the vice president of the European Commission responsible for the Digital

  • Agenda. This is true not only at the theoretical level but also in practice. A report by the leading

  consulting firm McKinsey found that “the intensity of big data varies across sectors but has reached critical mass in every sector” and that “we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture—all driven by big data as consumers, companies, and economic sectors exploit its potential.”

  With so much importance being given to data, it is not surprising that new business models are emerging, companies are being created, and apps and games are being designed with data collection as one of the main purposes. The most recent and compelling example is that of the Pok´emon Go mobile game, which was designed to allow users to collect characters in specific places around the

  Niantic Labs, the developer of the game that has practically gone viral in only a couple of weeks, has access to data about the whereabouts of players, their connections, and other data such as area, climate, time of the day, and so on. It collects data from roughly 9.5 million daily active

  European Commission—Press Release—Speech: The Data Gold Rush. 2014. Europa.Eu. .

McKinsey Global Institute. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity.

See, Hautala, Laura. 2016. Pokemon Go: Gotta Catch All Your Personal Data. CNET. . Big Data

  • This is a clear example

  of how apps and games are starting to develop around the business of data, but also of how the data can be collected in “fun” ways without the users necessarily being aware of how and what data are

1.2.1 C ONNECTION BETWEEN B

  

IG D ATA AND P ERSONAL D ATA

  The business of big data requires conducting a careful balancing exercise between the importance of harvesting the value of the data to foster innovation and evolution on the one hand, and the powerful impact that big data can have on many business sectors on the other hand. The manner in which personal data are collected and subsequently analyzed affects competition policy, antitrust policy, and consumer protection. In a paper published by the World Economic Forum, attention has been drawn to the fact that, “as ecosystem players look to use (mobile-generated) data, they face concerns about violating user trust, rights of expression, and confidentiality.” Big data and business are very much intertwined, and even more so when the big data in question is personal data, in particular because “for many online offerings which are presented or perceived as being ‘free’, personal information operates as a sort of indispensable currency used to pay for those services: ‘free’ online services are ‘paid for’ using personal data which have been valued in total at over EUR 300 billion and have been forecast to treble by 2020.”

  The concept of personal data is defined by Regulation 679/2016 as “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the

  While the list of factors specific to the identity of the person has been enriched from the previous definition of personal data that was contained in Directive 95/46/EC, the main elements remain the same. These elements have been discussed and elaborated by the Article 29 Working Party in its Opinion 4/2007, which establishes that there are four fundamental elements to establish whether an information is to be considered personal data.

  According to the Opinion, these elements are: “any information,” “relating to,” “identified or identifiable,” and “natural person.”

1.2.1.1 Any Information

  All information relevant to a person is included, regardless of the “position or capacity of those persons (as consumer, patient, employee, customer, etc.).” In this case, the information can be objective or subjective and does not necessarily have to be true or proven.

  Wagner, Kurt. 2016. How Many People Are Actually Playing Pokémon Go? Recode. .

World Economic Forum. 2012. Big Data, Big Impact: New Possibilities for International Development.

.

European Data Protection Supervisor. 2014. Preliminary Opinion of the European Data Protection Supervisor

Privacy and Competitiveness in the Age of Big Data: The Interplay between Data Protection, Competition

Law and Consumer Protection in the Digital Economy .

  

Article 4(1), Regulation (Eu) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection

of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing

Directive 95/46/EC (General Data Protection Regulation), Official Journal of the European Union, L 119/3, 4/5/2016.

  Article 29 Data Protection Working Party. 2007. Opinion 4/2007 on the Concept of Personal Data. Idem, p. 7.

  Big Data Management and Processing

  The words “any information” also imply information of any form, audio, text, video, images, etc. Importantly, the manner in which the information is stored is irrelevant. The Working Party expressly

  • as such data can be considered as information content as

  well as a link between the individual and the information. Because biometric data are unique to an individual, they can also be used as an identifier.

  1.2.1.2 Relating to

  Information related to an individual is information about that individual. The relationship between data and an individual is often self-evident, an example of which is when the data are stored in an individual employee’s files or in a medical record. This is, however, not always the case, especially when the information regards objects. Such objects belong to individuals, but additional meanings At least one of the following three elements should be present in order to consider information to be related to an individual: “content,” “purpose,” or “result.” An element of “content” is present when the information is in reference to an individual, regardless of the (intended) use of the information. The “purpose” element instead refers to whether the information is used or is likely to be used “with the purpose to evaluate, treat in a certain way or influence the status or behavior of an individual.” A “result” element is present when the use of the data is likely to have an impact on a certain person’s These elements are alternatives and are not cumulative, implying that one piece of data can relate to different individuals based on diverse elements.

  1.2.1.3 Identified or Identifiable

  “A natural person can be ‘identified’ when, within a group of persons, he or she is ‘distinguished’ When identification has not occurred but is possible, the individual is considered to be “identifiable.”

  In order to determine whether those with access to the data are able to identify the individual, all reasonable means likely to be used either by the controller or by any other person should be taken into consideration. The cost of identification, the intended purpose, the way the processing is structured, the advantage expected by the data controller, the interest at stake for the data subjects, and the risk of organizational dysfunctions and technical failures should be taken into account in the evaluation.

  1.2.1.4 Natural Person

  Directive 95/46/EC is applicable to the personal data of natural persons, a broad concept that calls for protection wholly independent from the residence or nationality of the data subject.