Big Data Management and Processing pdf pdf

Big Data Management

and Processing

Edited by Kuan-Ching Li Hai Jiang

Albert Y. Zomaya Big Data Series

Big Data Management and Processing

Big Data Management and Processing Edited by Kuan-Ching Li Guangzhou University, China Providence University, Taiwan

Hai Jiang Arkansas State University, USA

Albert Y. Zomaya University of Sydney, Australia CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 c 2017 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-4987-6807-8 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been

made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity

of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright

holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this

form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we

may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or

utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-

tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission

from the publishers.

For permission to photocopy or use material electronically from this work, please access ) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers,

MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of

users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been

arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for

identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at and the CRC Press Web site at

Contents

Paolo Balboni and Theodora Dragan

Junwhan Kim, Roberto Palmieri, and Binoy Ravindran

Guillaume Aupy, Anne Benoit, Loic Pottier, Padma Raghavan, Yves Robert, and Manu Shantharam

Xiongpai Qin and Keqin Li

Contents

Miyuru Dayarathna, Paul Fremantle, Srinath Perera, and Sriskandarajah Suhothayan

Vito Giovanni Castellana, Antonino Tumeo, Marco Minutoli, Marco Lattuada, and Fabrizio Ferrandi

Chapter 15 Complex Mining from Uncertain Big Data in Distributed Environments: Problems, Definitions, and Two Effective and Efficient Algorithms . . . . . . . . . . . . . . . 297 Alfredo Cuzzocrea, Carson Kai-Sang Leung, Fan Jiang,

and Richard Kyle MacKinnon

Big Data Management and Processing

(edited by Li, Jiang, and Zomaya) is a state-of-the-art book that deals with a wide range of topical themes in the field of Big Data. The book, which probes many issues related to this exciting and rapidly growing field, covers processing, management, analytics, and applications.

The many advances in Big Data research that we witness today are brought about because of the many developments we see in algorithms, high-performance computing, databases, datamining, machine learning, and so on. These developments are discussed in this book. The book also show- cases some of the interesting applications and technologies that are still evolving and that will lead to some serious breakthroughs in the coming few years.

I believe that Big Data Management and Processing is a very valuable addition to the literature. It will serve as a source of up-to-date research in this continuously developing area. The book also provides an opportunity for researchers to explore the use of advanced computing technologies and their impact on enhancing our capabilities to conduct more sophisticated studies.

I expect that Big Data Management and Processing will be well received by the research and development community. It should prove very beneficial for researchers and graduate students focusing on Big Data and will serve as a very useful reference for practitioners and application developers.

Sartaj Sahni University of Florida

The scope of Big Data today spans many aspects and it is not limited to main computing components (e.g., processors, storage devices, and visualization facilities) alone, but it expands into a much larger range of issues related to management and policy. Also, “Big Data” can mean “Big Energy,” because of the pressure that data places on a variety of infrastructures needed to host, manage, and transport data. This in turn raises various monetary, environmental, and system performance concerns.

Recent advances in software hardware technologies have improved the handling of big data. How- ever, there still remain many issues that are pertinent to the overloading that happens due to the processing of massive amounts of data, which calls for the development of various software and hardware solutions as well as new algorithms that are more capable of processing of data.

This book, Big Data Management and Processing, seeks to provide an opportunity for researchers to explore a range of big data-related issues and their impact on the design of new computing systems. The book is quite timely, since the field of big data computing as a whole is undergoing rapid changes on a daily basis. Vast literature exists today on such data processing paradigms and frameworks and their implications for a wide range of distributed platforms.

The book is intended to be a virtual roundtable of several outstanding researchers that one might invite to attend a conference on big data computing systems. Of course, the list of topics that is explored here is by no means exhaustive, but most of the conclusions provided here should be extended to the other computing platforms that are not covered here. There was a decision to limit the number of chapters while providing more pages for contributed authors to express their ideas, so that the book remains manageable within a single volume.

It is also hoped that the topics covered will get the readers to think of the implications of such new ideas on the developments in their own fields. The book endeavors to strike a balance between theoretical and practical coverage of innovative problem-solving techniques for a range of platforms. The book is intended to be a repository of paradigms, technologies, and applications that target the different facets of big data computing systems.

The 21 chapters are carefully selected to provide a wide scope with minimal overlap between the chapters so as to reduce duplications. Each contributor was asked that his/her chapter should cover review material as well as current developments. In addition, the choice of authors was made so as to select authors who are leaders in the respective disciplines.

First and foremost we would like to thank and acknowledge the contributors to this volume for their support and patience, and the reviewers for their useful comments and suggestions that helped in improving the earlier outline of the book and presentation of the material. Also, we extend our deepest thanks to Randi Cohen from CRC Press (USA) for his collaboration, guidance, and most importantly, patience in finalizing this handbook. Finally, we would like to acknowledge the efforts of the team from CRC Press’s production department for their extensive efforts during the many phases of this project and the timely fashion in which the book was produced.

Kuan-Ching Li

is a professor with appointments at the Guangzhou University, China and Providence University, Taiwan. He is a recipient of awards from Nvidia and support from a number of industrial companies. He has also received guest and distinguished chair professorships from universities in China and other countries. He has been actively involved in numerous conferences and workshops in program/general/steering conference chairman positions and as a program committee member, and has organized numerous conferences related to high-performance computing and computational science and engineering.

Professor Li is the Editor-in-Chief of technical publications such as International Journal of Computational Science and Engineering (IJCSE), International Journal of Embedded Systems (IJES), and International Journal of High Performance Computing and Networking (IJHPCN), all published by Inderscience. He also serves as an editorial board member and a guest editor for a number of journals. In addition, he is the author or editor of several technical professional books published by CRC Press, Springer, McGraw-Hill, and IGI Global. His topics of interest include GPU/manycore computing, big data, and cloud. He is a Member of the AAAS, a Senior Member of the IEEE, and a Fellow of the IET.

Hai Jiang is a professor in the Department of Computer Science at Arkansas State University, USA.

He received his BS degree from Beijing University of Posts and Telecommunications, China, and his MA and PhD degrees from Wayne State University, Detroit, Michigan, USA. His current research interests include parallel and distributed systems, computer and network security, high-performance computing and communication, big data, and modeling and simulation. He has published one book and several research papers in major international journals and conference proceedings. He has served as a U.S. National Science Foundation proposal review panelist and a U.S. DoE (Department of Energy) Smart Grid Investment Grant (SGIG) reviewer multiple times.

Professor Jiang serves as the executive editor of International Journal of High Performance

Computing and Networking

(IJHPCN). He is an editorial board member of International Journal

of Big Data Intelligence

(IJBDI), The Scientific World Journal (TSWJ), Open Journal of Internet

of Things

(OJIOT), and GSTF Journal on Social Computing (JSC) and a guest editor of IEEE Sys-

tems Journal

, International Journal of Ad Hoc and Ubiquitous Computing, Cluster Computing, and

The Scientific World Journal

for multiple special issues. He has also served as a general or program chair for some major conferences/workshops (CSE, HPCC, ISPA, GPC, ScaleCom, ESCAPE, GPU-Cloud, FutureTech, GPUTA, FC, SGC). He has been involved in more than 90 conferences and workshops as a session chair or program committee member, including major conferences such as AINA, ICPP, IUCC, ICPADS, TrustCom, HPCC, GPC, EUC, ICIS, SNPD, TSP, PDSEC, SECRUPT, and ScalCom. He is a professional member of ACM and IEEE Computer Society and a representa- tive of the U.S. NSF XSEDE (Extreme Science and Engineering Discovery Environment) Campus Champion for Arkansas State University.

Albert Y. Zomaya is the chair professor of high-performance computing and networking in the

School of Information Technologies, University of Sydney, Australia and also serves as the director of the Centre for Distributed and High Performance Computing. He has published more than 600 scientific papers and articles and is the author, coauthor, or editor of more than 20 books. He is the founding editor-in-chief of IEEE Transactions on Sustainable Computing and serves as an associate editor for more than 20 leading journals. He served as the editor-in-chief of IEEE Transactions on Computers from 2011 to 2014.

Editors

Professor Zomaya is the recipient of the IEEE Technical Committee on Parallel Processing Out- standing Service Award (2011), the IEEE Technical Committee on Scalable Computing Medal for Excellence in Scalable Computing (2011), and the IEEE Computer Society Technical Achievement Award (2014). He is a chartered engineer and a fellow of AAAS, IEEE, and IET. His research interests are in the areas of parallel and distributed computing and complex systems.

Syedmeysam Abolghasemi

Min Chen

University of Wollongong Wollongong, NSW, Australia

Jianguo Chen

College of Computer Science and Electronic Engineering

Hunan University Changsha, Hunan, China

Jinjun Chen

Swinburne Data Science Research Institute Swinburne University of Technology Australia

Department of Computer Science State University of New York New Paltz, New York

Huaming Chen

Alfredo Cuzzocrea

DIA Department University of Trieste and ICAR-CNR Trieste, Italy

Monica Ferreira da Silva

The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

Miyuru Dayarathna WSO2 Inc.

Mountain View, California

Theodora Dragan

School of Computing and Information Technology

High Performance Computing Pacific Northwest National Laboratory Richland, Washington

Department of Computer Science Old Dominion University Norfolk, Virginia

Tilburg, The Netherlands and

Antonio Juarez Alencar

The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

Guillaume Aupy

School of Engineering Vanderbilt University Nashville, Tennessee

Paolo Balboni

Tilburg Institute for Law, Technology, and Society

ICT Legal Consulting Milan, Italy and European Privacy Association Brussels, Belgium

Vito Giovanni Castellana

Mauro Penha Bastos

The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

Anne Benoit

LIP, ENS Lyon Lyon, France

Angelos Bilas

Institute of Computer Science (ICS) Foundation for Research and Technology—

Hellas (FORTH) and Department of Computer Science University of Crete, Greece

European Privacy Association Brussels, Belgium

Contributors Fabrizio Ferrandi

Clarkson University Potsdam, New York

Marco Lattuada

Dipartimento di Elettronica, Informazione e Bioingegneria

Politecnico di Milano Milano, Italy

Carson Kai-Sang Leung

Department of Computer Science University of Manitoba Winnipeg, MB, Canada

Boyang Li

Department of Electrical and Computer Engineering

Kenli Li

Junwhan Kim

College of Computer Science and Electronic Engineering

Hunan University and National Supercomputing Center in Changsha Changsha, Hunan, China

Keqin Li

College of Computer Science and Electronic Engineering

Hunan University Changsha, Hunan, China and Department of Computer Science State University of New York New Paltz, New York

Norman Lim

Department of Systems and Computer Engineering

CSIT University of the District of Columbia Washington, DC

Department of Computer Science University of Manitoba Winnipeg, MB, Canada

Dipartimento di Elettronica, Informazione e Bioingegneria

Technology—Hellas (FORTH) Greece

Politecnico di Milano Milano, Italy

Ryan Florin

Department of Computer Science Old Dominion University Norfolk, Virginia

Paul Fremantle WSO2 Inc.

Mountain View, California

Pilar Gonz´alez-F´erez

Department of Computer Engineering Technology University of Murcia Murcia, Spain and Institute of Computer Science (ICS) Foundation for Research and

Chonglin Gu

Fan Jiang

Department of Computer Science and Technology

Harbin Institute of Technology Shenzhen, China

Hejiao Huang

Department of Computer Science and Technology

Harbin Institute of Technology Shenzhen, China

Xiaohua Jia

Department of Computer Science and Technology

Harbin Institute of Technology Shenzhen, China

Carleton University Ottawa, ON, Canada Contributors Nam Ling

Department of Computer Engineering Santa Clara University Santa Clara, California

School of Computing and Communications

Mountain View, California

Florin Pop

Department of Computer Science

University Politehnica of Bucharest

Bucharest, Romania

Loic Pottier

LIP, ENS Lyon Lyon, France

Deepak Puthal

University of Technology Sydney, Australia

ECE, Virginia Tech Blacksburg, Virginia

Xiongpai Qin

Information School Renmin University of China Beijing, China

Padma Raghavan

School of Engineering Vanderbilt University Nashville, Tennessee

Rajiv Ranjan

School of Computing Science Newcastle University United Kingdom

Binoy Ravindran

ECE, Virginia Tech Blacksburg, Virginia

Yves Robert

Srinath Perera WSO2 Inc.

Roberto Palmieri

Chen Liu

Shikharesh Majumdar

Department of Electrical and Computer Engineering

Clarkson University Potsdam, New York

Yuhong Liu

Department of Computer Engineering Santa Clara University Santa Clara, California

Simone A. Ludwig

Department of Computer Science North Dakota State University Fargo, North Dakota

Richard Kyle MacKinnon

Department of Computer Science University of Manitoba Winnipeg, MB, Canada

Department of Systems and Computer Engineering

Department of Computer Science Old Dominion University

Carleton University Ottawa, Ontario, Canada

Marco Minutoli

High Performance Computing Pacific Northwest National

Laboratory Richland, Washington

Rim Moussa

LaTICE Lab University of Tunis and ENICarthage Tunis, Tunisia

Surya Nepal

CSIRO Data61 Australia

Stephan Olariu

LIP, ENS Lyon Lyon, France and University of Tennessee

Contributors Soror Sahri

Chengwen Wu

Mihaela-Andreea Vasile

Department of Computer Science University Politehnica of Bucharest Bucharest, Romania

Lei Wang

School of Computing and Information Technology

University of Wollongong Wollongong, NSW, Australia

Yu Wang

Department of Computer Engineering Santa Clara University Santa Clara, California

Department of Computer Science and Technology

High Performance Computing Pacific Northwest National

Tsinghua University Beijing, China

Aida Ghazi Zadeh

Department of Computer Science Old Dominion University Norfolk, Virginia

Guangyan Zhang

Department of Computer Science and Technology

Tsinghua University Beijing, China

Weimin Zheng

Department of Computer Science and Technology

Laboratory Richland, Washington

Antonino Tumeo

LIPADE Lab University Rene Descartes Paris, France

Jiangning Song

Eber Assis Schmitz

The T´ercio Pacitti Institute Federal University of Rio de Janeiro Brazil

Manu Shantharam

Computational Research Scientist San Diego Supercomputer Center San Diego, California

Jun Shen

School of Computing and Information Technology

University of Wollongong Wollongong, NSW, Australia

Department of Biochemistry and Molecular Biology

Hunan University Changsha, Hunan, China

Monash University Clayton, Victoria, Australia

Petros Sotirios Stefaneas

Department of Mathematics School of Applied Mathematics and

Physical Sciences National Technical University of Athens Athens, Greece

Sriskandarajah Suhothayan WSO2 Inc.

Mountain View, California

Zhuo Tang

College of Computer Science and Electronic Engineering

Tsinghua University Beijing, China

Paolo Balboni and Theodora Dragan

8 1.3.1 Traditional Data Protection Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Big Data Management and Processing

The overlap between big data and personal data is becoming increasingly relevant in today’s society, in light of the technological developments and, in particular, of the increased use of personal data as currency for purchasing “free” services. The global nature of big data, coupled with recently devel- oped data analytics and the interest of companies in predicting trends and consumer preferences, makes it necessary to analyze how personal data and big data are connected. With a focus on the quality of data as fundamental prerequisite for ensuring that outcomes are accurate and relevant, the authors explore the ways in which traditional and modern personal data protection principles apply to the big data context.

ABSTRACT

1.4 Conclusions and Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

1.3.2.3 Users’ Control of Their Own Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

1.3.2.2 Privacy by Design and by Default. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

1.3.2.1 Accountability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

1.3.2 Modern Data Protection Principles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12

1.3.1.2 Proportionality and Purpose Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10

9 1.3.1.1 Transparency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 1.3 Reconciling Traditional and Modern Data Protection Principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONTENTS Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 1.2.2 Competition Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 1.2.1.4 Natural Person . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 1.2.1.3 Identified or Identifiable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 1.2.1.2 Relating to . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 1.2.1.1 Any Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 1.2.1 Connection between Big Data and Personal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 1.2 Business of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 1.1.2 Structure and Arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 1.1.1 Topic, Approach, and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

It is not about the quantity of the data, but about the quality of it! All websites were last accessed on August 19, 2016.

1.1 INTRODUCTION

It is 2016 and big data is everywhere: in the newspapers, on TV, in research papers, and on the lips of every IT specialist. This is not only due to its catchy name, but also due to the sheer quantity of data

) available—according to IBM, we create 2.5 quintillion (2.5 times 10 bytes of data every day. But what is the big deal with big data and, in particular, to what extent does it affect, or overlap with, personal data?

OPIC PPROACH, AND ETHODOLOGY

1.1.1 T , A M

By way of introduction, the first step is to provide a definition of the concept that runs through this chapter. Various attempts at defining big data have been made in recent years, but no universal definition has been agreed upon yet. This is likely due to the constant evolution of this concept, which makes it difficult to describe without risking that the definition is either too generic or that it becomes inadequate within a short period of time.

One attempt at a universal definition was made by Gartner, a leading information technology research and advisory company, that defines big data as “high-volume, high-velocity and/or high- variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. ” In this case, data are regarded as assets, which attaches an intrinsic value to it. On the other hand, the Article 29 Data Pro- tection Working Party defines big data as “the exponential growth both in the availability and in the automated use of information: it refers to gigantic digital datasets held by corporations, governments _{This definition regards big data as a phenomenon composed of both the process of collecting information and the subsequent step of analyzing it. The common elements of the different definitions are therefore the size of the database and the analytical aspect, which together are expected to lead to better, more focused services and products, as well as more efficient business operations and more targeted approaches.}

Big data can be (and has been) used in an incredibly diverse range of situations. It was employed to help athletes of Great Britain’s rowing team achieve superior performance levels at the 2016 Olympic Games in Rio de Janeiro, by analyzing relevant information about their predecessors’ performance .

Predictive analytics were used in order to deal with traffic in highly congested cities, paving the way _{Further, big data can have a great impact on medical sciences, and has already helped boost obesity research results by enabling researchers to identify links between obesity and depression that were previously unknown .}

Although big data does not always consist of personal data and could, for example, relate to technical information or to information about objects or natural phenomena, the European Data Protection Supervisor (EDPS) pointed out in its Opinion 7/2015 that “one of the greatest values of big data for businesses and governments is derived from the monitoring of human behaviour, collectively and

IBM—What Is Big Data? 2016. IBM—Bringing Big Data to the Enterprise. .

What Is Big Data?—Gartner IT Glossary—Big Data. 2012. Gartner IT Glossary. Article 29 Data Protection Working Party. 2013. Opinion 03/2013 on Purpose Limitation.

Marr, Bernard. 2016. How Can Big Data and Analytics Help Athletes Win Olympic Gold in Rio 2016? Forbes.com.

Toesland, Finbarr. 2016. Smart-from-the-Start Cities Is the Way Forward. Raconteur.

Big Data Boosts Obesity Research Results Big Data

individually.” Analyzing and predicting human behavior enables decision makers in many areas to

make decisions that are more accurate, consistent, and economical, thereby enhancing the efficiency of society as a whole. A few fields of application that immediately come to mind when thinking of big data analytics based on personal data are university admissions, job recruitment, customer profiling, targeted marketing, or health services. Analyzing the information about millions of previous applicants, candidates, customers, or patients makes it easy to establish common threads and to predict all sorts of things, such as whether a specific person is fit for the job or is likely to develop a certain disease in the future.

An interesting study was recently conducted by the University of Cambridge Psychometrics Cen- tre: by analyzing the social networking “likes” of 58,000 users, researchers found that they were able to predict ethnic origin with an accuracy of 95% and religious or political orientation with an accu- _{Even more dramatically perhaps, they were able to predict psychological traits such as intelligence or emotional stability. The research was conducted using openly available data provided by the study subjects themselves (Facebook likes). Its results can be fine-tuned even further when cross-referencing them with data about the same subjects drawn from other sources, such as other social networking profiles or Internet usage habits. This is the point where big data starts overlapping with personal data, being separated only by a blurry border: “liking” a specific rock band does not constitute personal data as such, but the ability of linking this information directly to an individual or to other information makes it possible to identify what the person actually likes; furthermore, it enables to draw inferences about their personality, possibly revealing even sensitive political or religious preference (as was the case in the Cambridge study). “Companies may consider most of their data to be non personal data sets, but in reality it is now rare for data generated by user activity to be completely and irreversibly anonymised,” stated the EDPS in a recent Opinion. The availability of massive amounts of data from different sources combined with the desire to learn more about people’s habits therefore poses a serious challenge regarding the right to privacy of the individual and requires that the data protection principles are carefully taken into consideration.}

A fundamental part of big data analytics, however, is that the raw data must be accurate in order to lead to accurate results; massive quantities of inaccurate data can lead to skewed results and poor decision making. Bruce Schneier, an internationally renowned security technologist, refers to this as the “pollution problem of the information age.” There is a risk that analytical applications find patterns in cases where the individual facts are not directly correlated, which may lead to unfair conclusions and may adversely affect the persons involved. Another risk is that of being trapped in an “information bubble,” with people only being shown certain information that has been predicted to be of interest to them (but may not be in reality). In an article published in 2015 by TIME magazine, Facebook’s newsfeed algorithm was explained: whereas users have access to an average of 1,500 posts per day, they only see about 300 of them, which have been preselected by an algorithm in order to correspond as much as possible with the interests and preferences of each user. The author of the article concludes that “by structuring the environment, Facebook is training people implicitly to behave in a particular way in that algorithmic environment.” Therefore, data quality is paramount

European Data Protection Supervisor. 2015. Opinion 7/2015—Meeting the Challenges of Big Data: A Call for Transparency,

User Control, Data Protection by Design and Accountability . Available at: .

Kosinski, M., D. Stillwell, and T. Graepel. 2013. Private Traits and Attributes Are Predictable from Digital Records of

Human Behavior. Proceedings of the National Academy of Sciences 110 (15): 5802–5805. doi: 10.1073/pnas.1218772110.

European Data Protection Supervisor. 2014. Preliminary Opinion of the European Data Protection Supervisor

Privacy and Competitiveness in the Age of Big Data: The Interplay between Data Protection, Competition

Law and Consumer Protection in the Digital Economy .

Schneier, Bruce. 2015. Data and Goliath. New York: W.W. Norton.

Here’s How Your Facebook News Feed Actually Works. 2015. TIME.Com.

Big Data Management and Processing

to ensuring that the algorithms and analytical procedures are carried out successfully and that the predicted results correspond with the reality.

This chapter is aimed at analyzing the personal data protection legal compliance aspects of big data from a modern perspective, in order to identify the main challenges and to make adequate recommendations for the more efficient and lawful use of data as an asset. Few considerations are also made on the connection between big personal data analytics and competition law. The methodology is straightforward: the observations made throughout the chapter are based on the research conducted by regulatory and advisory bodies, as well as on the empirical research and practical experience of the authors. One of the chapter’s focal points is data quality. Owing to the nature of big data, raw data that are not of adequate quality (accurate, relevant, consistent, and complete) represent an obstacle in harnessing the value of the data. It is hoped that the chapter will enable the reader to gain a better understanding that a correct legal compliance management can make a fundamental difference between simply collecting vast amount of data, on the one hand, and effectively using the power of big data, on the other hand.

1.1.2 S TRUCTURE AND A RGUMENTS

This chapter is organized into two main sections: the first one addresses the personal data aspects of big data from a business perspective and is aimed at identifying the benefits and challenges of using big data analytics on massive personal datasets. The second part deals in detail with how the traditional data protection principles should be applied to big data analytics, while also tackling modern data protection principles. Overall, the chapter aims to serve as a good basis for understanding both the positive and the negative implications of deploying big data analytics on personal datasets. In addition, the chapter will focus on the importance of the quality of the data analyzed, on the different ways in which good levels of data quality can be achieved, and on the negative consequences that may ensue when they are not.

1.2 BUSINESS OF BIG DATA

It is by now clear: big data means big business. Data are frequently called “the oil of the 21st century” or “the fuel of the digital economy,” and the era we live in has been referred to as the “data gold rush” by Neelie Kroes, the vice president of the European Commission responsible for the Digital

Agenda. This is true not only at the theoretical level but also in practice. A report by the leading

consulting firm McKinsey found that “the intensity of big data varies across sectors but has reached critical mass in every sector” and that “we are on the cusp of a tremendous wave of innovation, productivity, and growth, as well as new modes of competition and value capture—all driven by big data as consumers, companies, and economic sectors exploit its potential.”

With so much importance being given to data, it is not surprising that new business models are emerging, companies are being created, and apps and games are being designed with data collection as one of the main purposes. The most recent and compelling example is that of the Pok´emon Go mobile game, which was designed to allow users to collect characters in specific places around the

Niantic Labs, the developer of the game that has practically gone viral in only a couple of weeks, has access to data about the whereabouts of players, their connections, and other data such as area, climate, time of the day, and so on. It collects data from roughly 9.5 million daily active

European Commission—Press Release—Speech: The Data Gold Rush. 2014. Europa.Eu. .

McKinsey Global Institute. 2011. Big Data: The Next Frontier for Innovation, Competition, and Productivity.

See, Hautala, Laura. 2016. Pokemon Go: Gotta Catch All Your Personal Data. CNET. . Big Data

This is a clear example

of how apps and games are starting to develop around the business of data, but also of how the data can be collected in “fun” ways without the users necessarily being aware of how and what data are

1.2.1 C ONNECTION BETWEEN B

IG D ATA AND P ERSONAL D ATA

The business of big data requires conducting a careful balancing exercise between the importance of harvesting the value of the data to foster innovation and evolution on the one hand, and the powerful impact that big data can have on many business sectors on the other hand. The manner in which personal data are collected and subsequently analyzed affects competition policy, antitrust policy, and consumer protection. In a paper published by the World Economic Forum, attention has been drawn to the fact that, “as ecosystem players look to use (mobile-generated) data, they face concerns about violating user trust, rights of expression, and confidentiality.” Big data and business are very much intertwined, and even more so when the big data in question is personal data, in particular because “for many online offerings which are presented or perceived as being ‘free’, personal information operates as a sort of indispensable currency used to pay for those services: ‘free’ online services are ‘paid for’ using personal data which have been valued in total at over EUR 300 billion and have been forecast to treble by 2020.”

The concept of personal data is defined by Regulation 679/2016 as “any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the

While the list of factors specific to the identity of the person has been enriched from the previous definition of personal data that was contained in Directive 95/46/EC, the main elements remain the same. These elements have been discussed and elaborated by the Article 29 Working Party in its Opinion 4/2007, which establishes that there are four fundamental elements to establish whether an information is to be considered personal data.

According to the Opinion, these elements are: “any information,” “relating to,” “identified or identifiable,” and “natural person.”

1.2.1.1 Any Information

All information relevant to a person is included, regardless of the “position or capacity of those persons (as consumer, patient, employee, customer, etc.).” In this case, the information can be objective or subjective and does not necessarily have to be true or proven.

Wagner, Kurt. 2016. How Many People Are Actually Playing Pokémon Go? Recode. .

World Economic Forum. 2012. Big Data, Big Impact: New Possibilities for International Development.

_{. European Data Protection Supervisor. 2014. Preliminary Opinion of the European Data Protection Supervisor
Privacy and Competitiveness in the Age of Big Data: The Interplay between Data Protection, Competition
Law and Consumer Protection in the Digital Economy .
_{Article 4(1), Regulation (Eu) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection
of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing
Directive 95/46/EC (General Data Protection Regulation), Official Journal of the European Union, L 119/3, 4/5/2016.
Article 29 Data Protection Working Party. 2007. Opinion 4/2007 on the Concept of Personal Data. Idem, p. 7.
Big Data Management and Processing
The words “any information” also imply information of any form, audio, text, video, images, etc. Importantly, the manner in which the information is stored is irrelevant. The Working Party expressly
as such data can be considered as information content as
well as a link between the individual and the information. Because biometric data are unique to an individual, they can also be used as an identifier.
1.2.1.2 Relating to
Information related to an individual is information about that individual. The relationship between data and an individual is often self-evident, an example of which is when the data are stored in an individual employee’s files or in a medical record. This is, however, not always the case, especially when the information regards objects. Such objects belong to individuals, but additional meanings _{At least one of the following three elements should be present in order to consider information to be related to an individual: “content,” “purpose,” or “result.” An element of “content” is present when the information is in reference to an individual, regardless of the (intended) use of the information. The “purpose” element instead refers to whether the information is used or is likely to be used “with the purpose to evaluate, treat in a certain way or influence the status or behavior of an individual.” A “result” element is present when the use of the data is likely to have an impact on a certain person’s _{These elements are alternatives and are not cumulative, implying that one piece of data can relate to different individuals based on diverse elements.}}
1.2.1.3 Identified or Identifiable
“A natural person can be ‘identified’ when, within a group of persons, he or she is ‘distinguished’ _{When identification has not occurred but is possible, the individual is considered to be “identifiable.”}
In order to determine whether those with access to the data are able to identify the individual, all reasonable means likely to be used either by the controller or by any other person should be taken into consideration. The cost of identification, the intended purpose, the way the processing is structured, the advantage expected by the data controller, the interest at stake for the data subjects, and the risk of organizational dysfunctions and technical failures should be taken into account in the evaluation.
1.2.1.4 Natural Person
Directive 95/46/EC is applicable to the personal data of natural persons, a broad concept that calls for protection wholly independent from the residence or nationality of the data subject.}}

Big Data Management and Processing pdf pdf

Chapter 15 Complex Mining from Uncertain Big Data in Distributed Environments: Problems, Definitions, and Two Effective and Efficient Algorithms . . . . . . . . . . . . . . . 297 Alfredo Cuzzocrea, Carson Kai-Sang Leung, Fan Jiang,

OPIC PPROACH, AND ETHODOLOGY

Dokumen yang terkait

Models of Computation For Big Data pdf pdf

Big Data Support of Urban Planning and Management The Experience in China pdf pdf

getting data right Tacking the challenges of Big Data volume and Variety pdf pdf

Machine Learning, Optimization, and Big Data 2016 pdf pdf

Beginning Big Data With Power BI and Excel 2013 by Neil Dunlop(pradyutvam2)[cpul] pdf pdf

Web and Big Data Part I 2018 pdf pdf

The Visual Organization Data Visualization, Big Data, and the Quest for Better Decisions pdf pdf

Big Data and Healthy Analytics pdf pdf

Big Data For Dummies 2010kaiser pdf pdf

Information Systems in the Big Data Era 2018 pdf pdf

Dukungan

Links

Big Data Management and Processing pdf pdf

Chapter 15 Complex Mining from Uncertain Big Data in Distributed Environments: Problems, Definitions, and Two Effective and Efficient Algorithms . . . . . . . . . . . . . . . 297 Alfredo Cuzzocrea, Carson Kai-Sang Leung, Fan Jiang,

OPIC PPROACH, AND ETHODOLOGY

Dokumen yang terkait

Models of Computation For Big Data pdf pdf

Big Data Support of Urban Planning and Management The Experience in China pdf pdf

getting data right Tacking the challenges of Big Data volume and Variety pdf pdf

Machine Learning, Optimization, and Big Data 2016 pdf pdf

Beginning Big Data With Power BI and Excel 2013 by Neil Dunlop(pradyutvam2)[cpul] pdf pdf

Web and Big Data Part I 2018 pdf pdf

The Visual Organization Data Visualization, Big Data, and the Quest for Better Decisions pdf pdf

Big Data and Healthy Analytics pdf pdf

Big Data For Dummies 2010kaiser pdf pdf

Information Systems in the Big Data Era 2018 pdf pdf

Dokumen yang Anda mencari sudah siap untuk unduhkan