Big Data Management and Processing pdf pdf
Big Data Management
and Processing
Edited by Kuan-Ching Li Hai Jiang
Albert Y. Zomaya Big Data Series
Big Data Management and Processing
Big Data Management and Processing Edited by Kuan-Ching Li Guangzhou University, China Providence University, Taiwan
Hai Jiang Arkansas State University, USA
Albert Y. Zomaya University of Sydney, Australia CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 c 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-4987-6807-8 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity
of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we
may rectify in any future reprint.Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.For permission to photocopy or use material electronically from this work, please access ) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers,
MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of
users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been
arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at and the CRC Press Web site atContents
Paolo Balboni and Theodora Dragan
Junwhan Kim, Roberto Palmieri, and Binoy Ravindran
Guillaume Aupy, Anne Benoit, Loic Pottier, Padma Raghavan, Yves Robert, and Manu Shantharam
Xiongpai Qin and Keqin Li
Contents
Miyuru Dayarathna, Paul Fremantle, Srinath Perera, and Sriskandarajah Suhothayan
Vito Giovanni Castellana, Antonino Tumeo, Marco Minutoli, Marco Lattuada, and Fabrizio Ferrandi
Chapter 15 Complex Mining from Uncertain Big Data in Distributed Environments: Problems, Definitions, and Two Effective and Efficient Algorithms . . . . . . . . . . . . . . . 297 Alfredo Cuzzocrea, Carson Kai-Sang Leung, Fan Jiang,
and Richard Kyle MacKinnon
Big Data Management and Processing
(edited by Li, Jiang, and Zomaya) is a state-of-the-art book that deals with a wide range of topical themes in the field of Big Data. The book, which probes many issues related to this exciting and rapidly growing field, covers processing, management, analytics, and applications.
The many advances in Big Data research that we witness today are brought about because of the many developments we see in algorithms, high-performance computing, databases, datamining, machine learning, and so on. These developments are discussed in this book. The book also show- cases some of the interesting applications and technologies that are still evolving and that will lead to some serious breakthroughs in the coming few years.
I believe that Big Data Management and Processing is a very valuable addition to the literature. It will serve as a source of up-to-date research in this continuously developing area. The book also provides an opportunity for researchers to explore the use of advanced computing technologies and their impact on enhancing our capabilities to conduct more sophisticated studies.
I expect that Big Data Management and Processing will be well received by the research and development community. It should prove very beneficial for researchers and graduate students focusing on Big Data and will serve as a very useful reference for practitioners and application developers.
Sartaj Sahni University of Florida
The scope of Big Data today spans many aspects and it is not limited to main computing components (e.g., processors, storage devices, and visualization facilities) alone, but it expands into a much larger range of issues related to management and policy. Also, “Big Data” can mean “Big Energy,” because of the pressure that data places on a variety of infrastructures needed to host, manage, and transport data. This in turn raises various monetary, environmental, and system performance concerns.
Recent advances in software hardware technologies have improved the handling of big data. How- ever, there still remain many issues that are pertinent to the overloading that happens due to the processing of massive amounts of data, which calls for the development of various software and hardware solutions as well as new algorithms that are more capable of processing of data.
This book, Big Data Management and Processing, seeks to provide an opportunity for researchers to explore a range of big data-related issues and their impact on the design of new computing systems. The book is quite timely, since the field of big data computing as a whole is undergoing rapid changes on a daily basis. Vast literature exists today on such data processing paradigms and frameworks and their implications for a wide range of distributed platforms.
The book is intended to be a virtual roundtable of several outstanding researchers that one might invite to attend a conference on big data computing systems. Of course, the list of topics that is explored here is by no means exhaustive, but most of the conclusions provided here should be extended to the other computing platforms that are not covered here. There was a decision to limit the number of chapters while providing more pages for contributed authors to express their ideas, so that the book remains manageable within a single volume.
It is also hoped that the topics covered will get the readers to think of the implications of such new ideas on the developments in their own fields. The book endeavors to strike a balance between theoretical and practical coverage of innovative problem-solving techniques for a range of platforms. The book is intended to be a repository of paradigms, technologies, and applications that target the different facets of big data computing systems.
The 21 chapters are carefully selected to provide a wide scope with minimal overlap between the chapters so as to reduce duplications. Each contributor was asked that his/her chapter should cover review material as well as current developments. In addition, the choice of authors was made so as to select authors who are leaders in the respective disciplines.
First and foremost we would like to thank and acknowledge the contributors to this volume for their support and patience, and the reviewers for their useful comments and suggestions that helped in improving the earlier outline of the book and presentation of the material. Also, we extend our deepest thanks to Randi Cohen from CRC Press (USA) for his collaboration, guidance, and most importantly, patience in finalizing this handbook. Finally, we would like to acknowledge the efforts of the team from CRC Press’s production department for their extensive efforts during the many phases of this project and the timely fashion in which the book was produced.
Kuan-Ching Li
is a professor with appointments at the Guangzhou University, China and Providence University, Taiwan. He is a recipient of awards from Nvidia and support from a num- ber of industrial companies. He has also received guest and distinguished chair professorships from universities in China and other countries. He has been actively involved in numerous conferences and workshops in program/general/steering conference chairman positions and as a program com- mittee member, and has organized numerous conferences related to high-performance computing and computational science and engineering.
Professor Li is the Editor-in-Chief of technical publications such as International Journal of Computational Science and Engineering (IJCSE), International Journal of Embedded Systems (IJES), and International Journal of High Performance Computing and Networking (IJHPCN), all published by Inderscience. He also serves as an editorial board member and a guest editor for a number of journals. In addition, he is the author or editor of several technical professional books published by CRC Press, Springer, McGraw-Hill, and IGI Global. His topics of interest include GPU/manycore computing, big data, and cloud. He is a Member of the AAAS, a Senior Member of the IEEE, and a Fellow of the IET.
Hai Jiang is a professor in the Department of Computer Science at Arkansas State University, USA.
He received his BS degree from Beijing University of Posts and Telecommunications, China, and his MA and PhD degrees from Wayne State University, Detroit, Michigan, USA. His current research interests include parallel and distributed systems, computer and network security, high-performance computing and communication, big data, and modeling and simulation. He has published one book and several research papers in major international journals and conference proceedings. He has served as a U.S. National Science Foundation proposal review panelist and a U.S. DoE (Department of Energy) Smart Grid Investment Grant (SGIG) reviewer multiple times.
Professor Jiang serves as the executive editor of International Journal of High Performance
Computing and Networking
(IJHPCN). He is an editorial board member of International Journal
of Big Data Intelligence
(IJBDI), The Scientific World Journal (TSWJ), Open Journal of Internet
of Things
(OJIOT), and GSTF Journal on Social Computing (JSC) and a guest editor of IEEE Sys-
tems Journal
, International Journal of Ad Hoc and Ubiquitous Computing, Cluster Computing, and
The Scientific World Journal
for multiple special issues. He has also served as a general or pro- gram chair for some major conferences/workshops (CSE, HPCC, ISPA, GPC, ScaleCom, ESCAPE, GPU-Cloud, FutureTech, GPUTA, FC, SGC). He has been involved in more than 90 conferences and workshops as a session chair or program committee member, including major conferences such as AINA, ICPP, IUCC, ICPADS, TrustCom, HPCC, GPC, EUC, ICIS, SNPD, TSP, PDSEC, SECRUPT, and ScalCom. He is a professional member of ACM and IEEE Computer Society and a representa- tive of the U.S. NSF XSEDE (Extreme Science and Engineering Discovery Environment) Campus Champion for Arkansas State University.
Albert Y. Zomaya is the chair professor of high-performance computing and networking in the
School of Information Technologies, University of Sydney, Australia and also serves as the director of the Centre for Distributed and High Performance Computing. He has published more than 600 scientific papers and articles and is the author, coauthor, or editor of more than 20 books. He is the founding editor-in-chief of IEEE Transactions on Sustainable Computing and serves as an associate editor for more than 20 leading journals. He served as the editor-in-chief of IEEE Transactions on Computers from 2011 to 2014.
Editors
Professor Zomaya is the recipient of the IEEE Technical Committee on Parallel Processing Out- standing Service Award (2011), the IEEE Technical Committee on Scalable Computing Medal for Excellence in Scalable Computing (2011), and the IEEE Computer Society Technical Achievement Award (2014). He is a chartered engineer and a fellow of AAAS, IEEE, and IET. His research interests are in the areas of parallel and distributed computing and complex systems.
Syedmeysam Abolghasemi
Min Chen
University of Wollongong Wollongong, NSW, Australia
Jianguo Chen
College of Computer Science and Electronic Engineering
Hunan University Changsha, Hunan, China
Jinjun Chen
Swinburne Data Science Research Institute Swinburne University of Technology Australia
Department of Computer Science State University of New York New Paltz, New York
Huaming Chen
Alfredo Cuzzocrea
DIA Department University of Trieste and ICAR-CNR Trieste, Italy
Monica Ferreira da Silva
The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil
Miyuru Dayarathna WSO2 Inc.
Mountain View, California
Theodora Dragan
School of Computing and Information Technology
High Performance Computing Pacific Northwest National Laboratory Richland, Washington
Department of Computer Science Old Dominion University Norfolk, Virginia
Tilburg, The Netherlands and
Antonio Juarez Alencar
The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil
Guillaume Aupy
School of Engineering Vanderbilt University Nashville, Tennessee
Paolo Balboni
Tilburg Institute for Law, Technology, and Society
ICT Legal Consulting Milan, Italy and European Privacy Association Brussels, Belgium
Vito Giovanni Castellana
Mauro Penha Bastos
The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil
Anne Benoit
LIP, ENS Lyon Lyon, France
Angelos Bilas
Institute of Computer Science (ICS) Foundation for Research and Technology—
Hellas (FORTH) and Department of Computer Science University of Crete, Greece
European Privacy Association Brussels, Belgium
Contributors Fabrizio Ferrandi
Clarkson University Potsdam, New York
Marco Lattuada
Dipartimento di Elettronica, Informazione e Bioingegneria
Politecnico di Milano Milano, Italy
Carson Kai-Sang Leung
Department of Computer Science University of Manitoba Winnipeg, MB, Canada
Boyang Li
Department of Electrical and Computer Engineering
Kenli Li
Junwhan Kim
College of Computer Science and Electronic Engineering
Hunan University and National Supercomputing Center in Changsha Changsha, Hunan, China
Keqin Li
College of Computer Science and Electronic Engineering
Hunan University Changsha, Hunan, China and Department of Computer Science State University of New York New Paltz, New York
Norman Lim
Department of Systems and Computer Engineering
CSIT University of the District of Columbia Washington, DC
Department of Computer Science University of Manitoba Winnipeg, MB, Canada
Dipartimento di Elettronica, Informazione e Bioingegneria
Technology—Hellas (FORTH) Greece
Politecnico di Milano Milano, Italy
Ryan Florin
Department of Computer Science Old Dominion University Norfolk, Virginia
Paul Fremantle WSO2 Inc.
Mountain View, California
Pilar Gonz´alez-F´erez
Department of Computer Engineering Technology University of Murcia Murcia, Spain and Institute of Computer Science (ICS) Foundation for Research and
Chonglin Gu
Fan Jiang
Department of Computer Science and Technology
Harbin Institute of Technology Shenzhen, China
Hejiao Huang
Department of Computer Science and Technology
Harbin Institute of Technology Shenzhen, China
Xiaohua Jia
Department of Computer Science and Technology
Harbin Institute of Technology Shenzhen, China
Carleton University Ottawa, ON, Canada Contributors Nam Ling
Department of Computer Engineering Santa Clara University Santa Clara, California
School of Computing and Communications
Mountain View, California
Florin Pop
Department of Computer Science
University Politehnica of Bucharest
Bucharest, Romania
Loic Pottier
LIP, ENS Lyon Lyon, France
Deepak Puthal
University of Technology Sydney, Australia
ECE, Virginia Tech Blacksburg, Virginia
Xiongpai Qin
Information School Renmin University of China Beijing, China
Padma Raghavan
School of Engineering Vanderbilt University Nashville, Tennessee
Rajiv Ranjan
School of Computing Science Newcastle University United Kingdom
Binoy Ravindran
ECE, Virginia Tech Blacksburg, Virginia
Yves Robert
Srinath Perera WSO2 Inc.
Roberto Palmieri
Chen Liu
Shikharesh Majumdar
Department of Electrical and Computer Engineering
Clarkson University Potsdam, New York
Yuhong Liu
Department of Computer Engineering Santa Clara University Santa Clara, California
Simone A. Ludwig
Department of Computer Science North Dakota State University Fargo, North Dakota
Richard Kyle MacKinnon
Department of Computer Science University of Manitoba Winnipeg, MB, Canada
Department of Systems and Computer Engineering
Department of Computer Science Old Dominion University
Carleton University Ottawa, Ontario, Canada
Marco Minutoli
High Performance Computing Pacific Northwest National
Laboratory Richland, Washington
Rim Moussa
LaTICE Lab University of Tunis and ENICarthage Tunis, Tunisia
Surya Nepal
CSIRO Data61 Australia
Stephan Olariu
LIP, ENS Lyon Lyon, France and University of Tennessee
Contributors Soror Sahri
Chengwen Wu
Mihaela-Andreea Vasile
Department of Computer Science University Politehnica of Bucharest Bucharest, Romania
Lei Wang
School of Computing and Information Technology
University of Wollongong Wollongong, NSW, Australia
Yu Wang
Department of Computer Engineering Santa Clara University Santa Clara, California
Department of Computer Science and Technology
High Performance Computing Pacific Northwest National
Tsinghua University Beijing, China
Aida Ghazi Zadeh
Department of Computer Science Old Dominion University Norfolk, Virginia
Guangyan Zhang
Department of Computer Science and Technology
Tsinghua University Beijing, China
Weimin Zheng
Department of Computer Science and Technology
Laboratory Richland, Washington
Antonino Tumeo
LIPADE Lab University Rene Descartes Paris, France
Jiangning Song
Eber Assis Schmitz
The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil
Manu Shantharam
Computational Research Scientist San Diego Supercomputer Center San Diego, California
Jun Shen
School of Computing and Information Technology
University of Wollongong Wollongong, NSW, Australia
Department of Biochemistry and Molecular Biology
Hunan University Changsha, Hunan, China
Monash University Clayton, Victoria, Australia
Petros Sotirios Stefaneas
Department of Mathematics School of Applied Mathematics and
Physical Sciences National Technical University of Athens Athens, Greece
Sriskandarajah Suhothayan WSO2 Inc.
Mountain View, California
Zhuo Tang
College of Computer Science and Electronic Engineering
Tsinghua University Beijing, China
Paolo Balboni and Theodora Dragan
8 1.3.1 Traditional Data Protection Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Big Data Management and Processing
The overlap between big data and personal data is becoming increasingly relevant in today’s society, in light of the technological developments and, in particular, of the increased use of personal data as currency for purchasing “free” services. The global nature of big data, coupled with recently devel- oped data analytics and the interest of companies in predicting trends and consumer preferences, makes it necessary to analyze how personal data and big data are connected. With a focus on the quality of data as fundamental prerequisite for ensuring that outcomes are accurate and relevant, the authors explore the ways in which traditional and modern personal data protection principles apply to the big data context.
ABSTRACT
1.4 Conclusions and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15
1.3.2.3 Users’ Control of Their Own Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14
1.3.2.2 Privacy by Design and by Default. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13
1.3.2.1 Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
1.3.2 Modern Data Protection Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12
1.3.1.2 Proportionality and Purpose Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10
9
9 1.3.1.1 Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 1.3 Reconciling Traditional and Modern Data Protection Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
CONTENTS Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 1.2.2 Competition Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 1.2.1.4 Natural Person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 1.2.1.3 Identified or Identifiable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 1.2.1.2 Relating to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 1.2.1.1 Any Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 1.2.1 Connection between Big Data and Personal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 1.2 Business of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 1.1.2 Structure and Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 1.1.1 Topic, Approach, and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
It is not about the quantity of the data, but about the quality of it! All websites were last accessed on August 19, 2016.
1.1 INTRODUCTION
It is 2016 and big data is everywhere: in the newspapers, on TV, in research papers, and on the lips of every IT specialist. This is not only due to its catchy name, but also due to the sheer quantity of data
18
- ) available—according to IBM, we create 2.5 quintillion (2.5 times 10 bytes of data every day. But what is the big deal with big data and, in particular, to what extent does it affect, or overlap with, personal data?
OPIC PPROACH, AND ETHODOLOGY
1.1.1 T , A M
By way of introduction, the first step is to provide a definition of the concept that runs through this chapter. Various attempts at defining big data have been made in recent years, but no universal definition has been agreed upon yet. This is likely due to the constant evolution of this concept, which makes it difficult to describe without risking that the definition is either too generic or that it becomes inadequate within a short period of time.
One attempt at a universal definition was made by Gartner, a leading information technology research and advisory company, that defines big data as “high-volume, high-velocity and/or high- variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. ” In this case, data are regarded as assets, which attaches an intrinsic value to it. On the other hand, the Article 29 Data Pro- tection Working Party defines big data as “the exponential growth both in the availability and in the automated use of information: it refers to gigantic digital datasets held by corporations, governments This definition regards big data as a phenomenon composed of both the process of collecting information and the subsequent step of analyzing it. The common elements of the different definitions are there- fore the size of the database and the analytical aspect, which together are expected to lead to better, more focused services and products, as well as more efficient business operations and more targeted approaches.
Big data can be (and has been) used in an incredibly diverse range of situations. It was employed to help athletes of Great Britain’s rowing team achieve superior performance levels at the 2016 Olympic Games in Rio de Janeiro, by analyzing relevant information about their predecessors’ performance .
Predictive analytics were used in order to deal with traffic in highly congested cities, paving the way Further, big data can have a great impact on medical sciences, and has already helped boost obesity research results by enabling researchers to identify links between obesity and depression that were previously unknown .
Although big data does not always consist of personal data and could, for example, relate to techni- cal information or to information about objects or natural phenomena, the European Data Protection Supervisor (EDPS) pointed out in its Opinion 7/2015 that “one of the greatest values of big data for businesses and governments is derived from the monitoring of human behaviour, collectively and
IBM—What Is Big Data? 2016. IBM—Bringing Big Data to the Enterprise. .
What Is Big Data?—Gartner IT Glossary—Big Data. 2012. Gartner IT Glossary. Article 29 Data Protection Working Party. 2013. Opinion 03/2013 on Purpose Limitation.
Marr, Bernard. 2016. How Can Big Data and Analytics Help Athletes Win Olympic Gold in Rio 2016? Forbes.com.
.
Toesland, Finbarr. 2016. Smart-from-the-Start Cities Is the Way Forward. Raconteur.
Big Data Boosts Obesity Research Results Big Data
- individually.” Analyzing and predicting human behavior enables decision makers in many areas to
make decisions that are more accurate, consistent, and economical, thereby enhancing the efficiency of society as a whole. A few fields of application that immediately come to mind when thinking of big data analytics based on personal data are university admissions, job recruitment, customer profiling, targeted marketing, or health services. Analyzing the information about millions of previ- ous applicants, candidates, customers, or patients makes it easy to establish common threads and to predict all sorts of things, such as whether a specific person is fit for the job or is likely to develop a certain disease in the future.
An interesting study was recently conducted by the University of Cambridge Psychometrics Cen- tre: by analyzing the social networking “likes” of 58,000 users, researchers found that they were able to predict ethnic origin with an accuracy of 95% and religious or political orientation with an accu- Even more dramatically perhaps, they were able to predict psychological traits such as intelligence or emotional stability. The research was conducted using openly available data provided by the study subjects themselves (Facebook likes). Its results can be fine-tuned even fur- ther when cross-referencing them with data about the same subjects drawn from other sources, such as other social networking profiles or Internet usage habits. This is the point where big data starts overlapping with personal data, being separated only by a blurry border: “liking” a specific rock band does not constitute personal data as such, but the ability of linking this information directly to an individual or to other information makes it possible to identify what the person actually likes; furthermore, it enables to draw inferences about their personality, possibly revealing even sensitive political or religious preference (as was the case in the Cambridge study). “Companies may consider most of their data to be non personal data sets, but in reality it is now rare for data generated by user activity to be completely and irreversibly anonymised,” stated the EDPS in a recent Opinion. The availability of massive amounts of data from different sources combined with the desire to learn more about people’s habits therefore poses a serious challenge regarding the right to privacy of the individual and requires that the data protection principles are carefully taken into consideration.
A fundamental part of big data analytics, however, is that the raw data must be accurate in order to lead to accurate results; massive quantities of inaccurate data can lead to skewed results and poor decision making. Bruce Schneier, an internationally renowned security technologist, refers to this as the “pollution problem of the information age.” There is a risk that analytical applications find patterns in cases where the individual facts are not directly correlated, which may lead to unfair conclusions and may adversely affect the persons involved. Another risk is that of being trapped in an “information bubble,” with people only being shown certain information that has been predicted to be of interest to them (but may not be in reality). In an article published in 2015 by TIME magazine, Facebook’s newsfeed algorithm was explained: whereas users have access to an average of 1,500 posts per day, they only see about 300 of them, which have been preselected by an algorithm in order to correspond as much as possible with the interests and preferences of each user. The author of the article concludes that “by structuring the environment, Facebook is training people implicitly to behave in a particular way in that algorithmic environment.” Therefore, data quality is paramount
European Data Protection Supervisor. 2015. Opinion 7/2015—Meeting the Challenges of Big Data: A Call for Transparency,
User Control, Data Protection by Design and Accountability . Available at: .
Kosinski, M., D. Stillwell, and T. Graepel. 2013. Private Traits and Attributes Are Predictable from Digital Records of
Human Behavior. Proceedings of the National Academy of Sciences 110 (15): 5802–5805. doi: 10.1073/pnas.1218772110.
European Data Protection Supervisor. 2014. Preliminary Opinion of the European Data Protection Supervisor
Privacy and Competitiveness in the Age of Big Data: The Interplay between Data Protection, Competition
Law and Consumer Protection in the Digital Economy .
Schneier, Bruce. 2015. Data and Goliath. New York: W.W. Norton.
Here’s How Your Facebook News Feed Actually Works. 2015. TIME.Com.
Big Data Management and Processing
to ensuring that the algorithms and analytical procedures are carried out successfully and that the predicted results correspond with the reality.
This chapter is aimed at analyzing the personal data protection legal compliance aspects of big data from a modern perspective, in order to identify the main challenges and to make adequate rec- ommendations for the more efficient and lawful use of data as an asset. Few considerations are also made on the connection between big personal data analytics and competition law. The methodology is straightforward: the observations made throughout the chapter are based on the research conducted by regulatory and advisory bodies, as well as on the empirical research and practical experience of the authors. One of the chapter’s focal points is data quality. Owing to the nature of big data, raw data that are not of adequate quality (accurate, relevant, consistent, and complete) represent an obstacle in harnessing the value of the data. It is hoped that the chapter will enable the reader to gain a bet- ter understanding that a correct legal compliance management can make a fundamental difference between simply collecting vast amount of data, on the one hand, and effectively using the power of big data, on the other hand.
1.1.2 S TRUCTURE AND A RGUMENTS
This chapter is organized into two main sections: the first one addresses the personal data aspects of big data from a business perspective and is aimed at identifying the benefits and challenges of using big data analytics on massive personal datasets. The second part deals in detail with how the tradi- tional data protection principles should be applied to big data analytics, while also tackling modern data protection principles. Overall, the chapter aims to serve as a good basis for understanding both the positive and the negative implications of deploying big data analytics on personal datasets. In addition, the chapter will focus on the importance of the quality of the data analyzed, on the different ways in which good levels of data quality can be achieved, and on the negative consequences that may ensue when they are not.
1.2 BUSINESS OF BIG DATA
It is by now clear: big data means big business. Data are frequently called “the oil of the 21st century” or “the fuel of the digital economy,” and the era we live in has been referred to as the “data gold rush” by Neelie Kroes, the vice president of the European Commission responsible for the Digital
- Agenda. This is true not only at the theoretical level but also in practice. A report by the leading
consulting firm McKinsey found that “the intensity of big data varies across sectors but has reached critical mass in every sector” and that “we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture—all driven by big data as consumers, companies, and economic sectors exploit its potential.”
With so much importance being given to data, it is not surprising that new business models are emerging, companies are being created, and apps and games are being designed with data collection as one of the main purposes. The most recent and compelling example is that of the Pok´emon Go mobile game, which was designed to allow users to collect characters in specific places around the
Niantic Labs, the developer of the game that has practically gone viral in only a couple of weeks, has access to data about the whereabouts of players, their connections, and other data such as area, climate, time of the day, and so on. It collects data from roughly 9.5 million daily active
European Commission—Press Release—Speech: The Data Gold Rush. 2014. Europa.Eu. .
McKinsey Global Institute. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity. See, Hautala, Laura. 2016. Pokemon Go: Gotta Catch All Your Personal Data. CNET. . Big Data
of how apps and games are starting to develop around the business of data, but also of how the data can be collected in “fun” ways without the users necessarily being aware of how and what data are
1.2.1 C ONNECTION BETWEEN B
IG D ATA AND P ERSONAL D ATA
The business of big data requires conducting a careful balancing exercise between the importance of harvesting the value of the data to foster innovation and evolution on the one hand, and the powerful impact that big data can have on many business sectors on the other hand. The manner in which personal data are collected and subsequently analyzed affects competition policy, antitrust policy, and consumer protection. In a paper published by the World Economic Forum, attention has been drawn to the fact that, “as ecosystem players look to use (mobile-generated) data, they face concerns about violating user trust, rights of expression, and confidentiality.” Big data and business are very much intertwined, and even more so when the big data in question is personal data, in particular because “for many online offerings which are presented or perceived as being ‘free’, personal information operates as a sort of indispensable currency used to pay for those services: ‘free’ online services are ‘paid for’ using personal data which have been valued in total at over EUR 300 billion and have been forecast to treble by 2020.”
The concept of personal data is defined by Regulation 679/2016 as “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the
While the list of factors specific to the identity of the person has been enriched from the previous definition of personal data that was contained in Directive 95/46/EC, the main elements remain the same. These elements have been discussed and elaborated by the Article 29 Working Party in its Opinion 4/2007, which establishes that there are four fundamental elements to establish whether an information is to be considered personal data.
According to the Opinion, these elements are: “any information,” “relating to,” “identified or identifiable,” and “natural person.”
1.2.1.1 Any Information
All information relevant to a person is included, regardless of the “position or capacity of those persons (as consumer, patient, employee, customer, etc.).” In this case, the information can be objective or subjective and does not necessarily have to be true or proven.
Wagner, Kurt. 2016. How Many People Are Actually Playing Pokémon Go? Recode. .
World Economic Forum. 2012. Big Data, Big Impact: New Possibilities for International Development.
.European Data Protection Supervisor. 2014. Preliminary Opinion of the European Data Protection Supervisor
Privacy and Competitiveness in the Age of Big Data: The Interplay between Data Protection, Competition
Law and Consumer Protection in the Digital Economy .
Article 4(1), Regulation (Eu) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection
of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC (General Data Protection Regulation), Official Journal of the European Union, L 119/3, 4/5/2016.
Article 29 Data Protection Working Party. 2007. Opinion 4/2007 on the Concept of Personal Data. Idem, p. 7.
Big Data Management and Processing
The words “any information” also imply information of any form, audio, text, video, images, etc. Importantly, the manner in which the information is stored is irrelevant. The Working Party expressly
- as such data can be considered as information content as
well as a link between the individual and the information. Because biometric data are unique to an individual, they can also be used as an identifier.
1.2.1.2 Relating to
Information related to an individual is information about that individual. The relationship between data and an individual is often self-evident, an example of which is when the data are stored in an individual employee’s files or in a medical record. This is, however, not always the case, especially when the information regards objects. Such objects belong to individuals, but additional meanings At least one of the following three elements should be present in order to consider information to be related to an individual: “content,” “purpose,” or “result.” An element of “content” is present when the information is in reference to an individual, regardless of the (intended) use of the information. The “purpose” element instead refers to whether the information is used or is likely to be used “with the purpose to evaluate, treat in a certain way or influence the status or behavior of an individual.” A “result” element is present when the use of the data is likely to have an impact on a certain person’s These elements are alternatives and are not cumulative, implying that one piece of data can relate to different individuals based on diverse elements.
1.2.1.3 Identified or Identifiable
“A natural person can be ‘identified’ when, within a group of persons, he or she is ‘distinguished’ When identification has not occurred but is possible, the individual is considered to be “identifiable.”
In order to determine whether those with access to the data are able to identify the individual, all reasonable means likely to be used either by the controller or by any other person should be taken into consideration. The cost of identification, the intended purpose, the way the processing is structured, the advantage expected by the data controller, the interest at stake for the data subjects, and the risk of organizational dysfunctions and technical failures should be taken into account in the evaluation.
1.2.1.4 Natural Person
Directive 95/46/EC is applicable to the personal data of natural persons, a broad concept that calls for protection wholly independent from the residence or nationality of the data subject.