Big Data Computing Ebook free download pdf pdf

  

Big Data

Computing

  

This page intentionally left blank This page intentionally left blank Big Data

Computing

Edited by

  

Rajendra Akerkar

Western Norway Research Institute

Sogndal, Norway

  Taylor & Francis Group © 2014 by Taylor & Francis Group, LLC Boca Raton, FL 33487-2742 6000 Broken Sound Parkway NW, Suite 300 CRC Press International Standard Book Number-13: 978-1-4665-7838-8 (eBook - PDF) Version Date: 20131028 No claim to original U.S. Government works CRC Press is an imprint of Taylor & Francis Group, an Informa business

copyright holders if permission to publish in this form has not been obtained. If any copyright material has

have attempted to trace the copyright holders of all material reproduced in this publication and apologize to

responsibility for the validity of all materials or the consequences of their use. The authors and publishers

have been made to publish reliable data and information, but the author and publisher cannot assume

not been acknowledged please write and let us know so we may rectify in any future reprint.

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts

without written permission from the publishers.

ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,

including photocopying, microfilming, and recording, or in any information storage or retrieval system,

http://www.crcpress.com and the CRC Press Web site at http://www.taylorandfrancis.com Visit the Taylor & Francis Web site at only for identification and explanation without intent to infringe.

  

To

All the visionary minds who have helped create a modern data science profession

  

This page intentionally left blank This page intentionally left blank

  Contents

  

  

  

  

  

  

  

  

  

  

  Contents

  

  

  

  

  

  

  

  

  

  

  

  

  

Index ..................................................................................................................... 539

  In the international marketplace, businesses, suppliers, and customers create * and consume vast amounts of information. Gartner predicts that enterprise data in all forms will grow up to 650% over the next five years. According to IDC, the world’s volume of data doubles every 18 months. Digital infor- mation is doubling every 1.5 years and will exceed 1000 exabytes next year according to the MIT Centre for Digital Research. In 2011, medical centers held almost 1 billion terabytes of data. That is almost 2000 billion file cabinets’ worth of information. This deluge of data, often referred to as Big Data, obvi- ously creates a challenge to the business community and data scientists.

  The term Big Data refers to data sets the size of which is beyond the capa- bilities of current database technology. It is an emerging field where innova- tive technology offers alternatives in resolving the inherent problems that appear when working with massive data, offering new ways to reuse and extract value from information.

  Businesses and government agencies aggregate data from numerous pri- vate and/or public data sources. Private data is information that any orga- nization exclusively stores that is available only to that organization, such as employee data, customer data, and machine data (e.g., user transactions and customer behavior). Public data is information that is available to the public for a fee or at no charge, such as credit ratings, social media content (e.g.,  LinkedIn, Facebook, and Twitter). Big Data has now reached every sector in the world economy. It is transforming competitive opportunities in every industry sector including banking, healthcare, insurance, manu- facturing, retail, wholesale, transportation, communications, construction, education, and utilities. It also plays key roles in trade operations such as marketing, operations, supply chain, and new business models. It is becom- ing rather evident that enterprises that fail to use their data efficiently are at a large competitive disadvantage from those that can analyze and act on their data. The possibilities of Big Data continue to evolve swiftly, driven by inno- vation in the underlying technologies, platforms, and analytical capabilities for handling data, as well as the evolution of behavior among its users as increasingly humans live digital lives.

  It is interesting to know that Big Data is different from the conventional data models (e.g., relational databases and data models, or conventional gov- ernance models). Thus, it is triggering organizations’ concern as they try to separate information nuggets from the data heap. The conventional models of structured, engineered data do not adequately reveal the realities of Big *

http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf

  Preface

  Data. The key to leveraging Big Data is to realize these differences before expediting its use. The most noteworthy difference is that data are typically governed in a centralized manner, but Big Data is self-governing. Big Data is created either by a rapidly expanding universe of machines or by users of highly varying expertise. As a result, the composition of traditional data will naturally vary considerably from Big Data. The composition of data serves a specific purpose and must be more durable and structured, whereas Big Data will cover many topics, but not all topics will yield useful information for the business, and thus they will be sparse in relevancy and structure.

  The technology required for Big Data computing is developing at a sat- isfactory rate due to market forces and technological evolution. The ever- growing enormous amount of data, along with advanced tools of exploratory data analysis, data mining/machine learning, and data visualization, offers a whole new way of understanding the world.

  Another interesting fact about Big Data is that not everything that is con- sidered “Big Data” is in fact Big Data. One needs to explore deep into the scientific aspects, such as analyzing, processing, and storing huge volumes of data. That is the only way of using tools effectively. Data developers/ scientists need to know about analytical processes, statistics, and machine learning. They also need to know how to use specific data to program algo- rithms. The core is the analytical side, but they also need the scientific back- ground and in-depth technical knowledge of the tools they work with in order to gain control of huge volumes of data. There is no one tool that offers this per se.

  As a result, the main challenge for Big Data computing is to find a novel solution, keeping in mind the fact that data sizes are always growing. This solution should be applicable for a long period of time. This means that the key condition a solution has to satisfy is scalability. Scalability is the ability of a system to accept increased input volume without impacting the profits; that is, the gains from the input increment should be proportional to the incre- ment itself. For a system to be totally scalable, the size of its input should not be a design parameter. Pushing the system designer to consider all possible deployment sizes to cope with different input sizes leads to a scalable archi- tecture without primary bottlenecks. Yet, apart from scalability, there are other requisites for a Big Data–intensive computing system.

  Although Big Data is an emerging field in data science, there are very few books available in the market. This book provides authoritative insights and highlights valuable lessons learnt by authors—with experience.

  Some universities in North America and Europe are doing their part to feed the need for analytics skills in this era of Big Data. In recent years, they have introduced master of science degrees in Big Data analytics, data science, and business analytics. Some contributing authors have been involved in developing a course curriculum in their respective institution and country. The number of courses on “Big Data” will increase worldwide

  Preface

  of productivity growth, innovation, and consumer surplus, according to a * research by MGI and McKinsey’s Business Technology Office.

  The main features of this book can be summarized as

  1. It describes the contemporary state of the art in a new field of Big Data computing.

  2. It presents the latest developments, services, and main players in this explosive field.

  3. Contributors to the book are prominent researchers from academia and practitioners from industry.

Organization

  This book comprises five sections, each of which covers one aspect of Big Data computing. Section I focuses on what Big Data is, why it is important, and how it can be used. Section II focuses on semantic technologies and Big Data. Section III focuses on Big Data processing—tools, technologies, and methods essential to analyze Big Data efficiently. Section IV deals with business and economic perspectives. Finally, Section V focuses on various stimulating Big Data applications. Below is a brief outline with more details on what each chapter is about.

Section I: Introduction

  Chapter 1 provides an approach to address the problem of “understanding” Big Data in an effective and efficient way. The idea is to make adequately grained and expressive knowledge representations and fact collections that evolve naturally, triggered by new tokens of relevant data coming along. The chapter also presents primary considerations on assessing fitness in an evolving knowledge ecosystem.

  Chapter 2 then gives an overview of the main features that can character- ize architectures for solving a Big Data problem, depending on the source of data, on the type of processing required, and on the application context in * which it should be operated.

  

  Preface

  Chapter 3 discusses Big Data from three different standpoints: the busi- ness, the technological, and the social. This chapter lists some relevant initia- tives and selected thoughts on Big Data.

Section II: Semantic Technologies and Big Data

  Chapter 4 presents foundations of Big Semantic Data management. The chapter sketches a route from the current data deluge, the concept of Big Data, and the need of machine-processable semantics on the Web. Further, this chapter justifies different management problems arising in Big Semantic Data by characterizing their main stakeholders by role and nature.

  A number of challenges arising in the context of Linked Data in Enterprise Integration are covered in Chapter 5. A key prerequisite for addressing these challenges is the establishment of efficient and effective link discovery and data integration techniques, which scale to large-scale data scenarios found in the enterprise. This chapter also presents the transformation step of Linked Data Integration by two algorithms.

  Chapter 6 proposes steps toward the solution of the data access prob- lem that end-users usually face when dealing with Big Data. The chapter discusses the state of the art in ontology-based data access (OBDA) and explains why OBDA is the superior approach to the data access challenge posed by Big Data. It also explains why the field of OBDA is currently not yet sufficiently complete to deal satisfactorily with these problems, and it finally presents thoughts on escalating OBDA to a level where it can be well deployed to Big Data.

  Chapter 7 addresses large-scale semantic interoperability problems of data in the domain of public sector administration and proposes practical solutions to these problems by using semantic technologies in the context of Web services and open data. This chapter also presents a case of the Estonian semantic interoperability framework of state information systems and related data interoperability solutions.

Section III: Big Data Processing

  Chapter 8 presents a new way of query processing for Big Data where data exploration becomes a first-class citizen. Data exploration is desirable when new big chunks of data arrive speedily and one needs to react quickly. This chapter focuses on database systems technology, which for several years has

  Preface

  Chapter 9 explores the MapReduce model, a programming model used to develop largely parallel applications that process and generate large amounts of data. This chapter also discusses how MapReduce is implemented in Hadoop and provides an overview of its architecture.

  A particular class of stream-based joins, namely, a join of a single stream with a traditional relational table, is discussed in Chapter 10. Two available stream-based join algorithms are investigated in this chapter.

Section IV: Big Data and Business

  Chapter 11 provides the economic value of Big Data from a macro- and a microeconomic perspective. The chapter illustrates how technology and new skills can nurture opportunities to derive benefits from large, constantly growing, dispersed data sets and how semantic interoperability and new licensing strategies will contribute to the uptake of Big Data as a business enabler and a source of value creation.

  Nowadays businesses are enhancing their business intelligence prac- tices to include predictive analytics and data mining. This combines the best of strategic reporting and basic forecasting with advanced operational intelligence and decision-making functions. Chapter 12 discusses how Big Data technologies, advanced analytics, and business intelligence (BI) are interrelated. This chapter also presents various areas of advanced analytic technologies.

Section V: Big Data Applications

  The final section of the book covers application topics, starting in Chapter 13 with novel concept-level approaches to opinion mining and sentiment analy- sis that allow a more efficient passage from (unstructured) textual informa- tion to (structured) machine-processable data, in potentially any domain.

  Chapter 14 introduces the spChains framework, a modular approach to sup- port mastering of complex event processing (CEP) queries in an abridged, but effective, manner based on stream processing block composition. The approach aims at unleashing the power of CEP systems for teams having reduced insights into CEP systems.

  Real-time electricity metering operated at subsecond data rates in a grid with 20 million nodes originates more than 5 petabytes daily. The requested decision-making timeframe in SCADA systems operating load shedding

  Preface

  optimization task and the data management approach permitting a solution to the issue.

  Chapter 16 presents an innovative outlook to the scaling of geographi- cal space using large street networks involving both cities and countryside. Given a street network of an entire country, the chapter proposes to decom- pose the street network into individual blocks, each of which forms a mini- mum ring or cycle such as city blocks and field blocks. The chapter further elaborates the power of the block perspective in reflecting the patterns of geographical space.

  Chapter 17 presents the influence of recent advances in natural language processing on business knowledge life cycles and processes of knowledge management. The chapter also sketches envisaged developments and mar- ket impacts related to the integration of semantic technology and knowledge management.

Intended Audience

  The aim of this book is to be accessible to researchers, graduate students, and to application-driven practitioners who work in data science and related fields. This edited book requires no previous exposure to large-scale data analysis or NoSQL tools. Acquaintance with traditional databases is an added advantage.

  This book provides the reader with a broad range of Big Data concepts, tools, and techniques. A wide range of research in Big Data is covered, and comparisons between state-of-the-art approaches are provided. This book can thus help researchers from related fields (such as databases, data sci- ence, data mining, machine learning, knowledge engineering, information retrieval, information systems), as well as students who are interested in entering this field of research, to become familiar with recent research devel- opments and identify open research challenges on Big Data. This book can help practitioners to better understand the current state of the art in Big Data techniques, concepts, and applications.

  The technical level of this book also makes it accessible to students taking advanced undergraduate level courses on Big Data or Data Science. Although such courses are currently rare, with the ongoing challenges that the areas of intelligent information/data management pose in many organizations in both the public and private sectors, there is a demand worldwide for gradu- ates with skills and expertise in these areas. It is hoped that this book helps address this demand.

  In addition, the goal is to help policy-makers, developers and engineers, data scientists, as well as individuals, navigate the new Big Data landscape.

  Preface

Acknowledgments

  The organization and the contents of this edited book have benefited from our outstanding contributors. I am very proud and happy that these researchers agreed to join this project and prepare a chapter for this book. I am also very pleased to see this materialize in the way I originally envisioned. I hope this book will be a source of inspiration to the readers. I especially wish to express my sincere gratitude to all the authors for their contribution to this project.

  I thank the anonymous reviewers who provided valuable feedback and helpful suggestions. I also thank Aastha Sharma, David Fausel, Rachel Holt, and the staff at

  CRC Press (Taylor & Francis Group), who supported this book project right from the start.

  Last, but not least, a very big thanks to my colleagues at Western Norway Research Institute (Vestlandsforsking, Norway) for their constant encour- agement and understanding.

  I wish all readers a fruitful time reading this book, and hope that they expe- rience the same excitement as I did—and still do—when dealing with Data.

  Rajendra Akerkar

  

This page intentionally left blank This page intentionally left blank

  

Rajendra Akerkar is professor and senior researcher at Western Norway

  Research Institute (Vestlandsforsking), Norway, where his main domain of research is semantic technologies with the aim of combining theoretical results with high-impact real-world solutions. He also holds visiting aca- demic assignments in India and abroad. In 1997, he founded and chaired the Technomathematics Research Foundation (TMRF) in India.

  His research and teaching experience spans over 23 years in academia including different universities in Asia, Europe, and North America. His research interests include ontologies, semantic technologies, knowledge sys- tems, large-scale data mining, and intelligent systems.

  He received DAAD fellowship in 1990 and is also a recipient of the pres- tigious BOYSCASTS Young Scientist award of the Department of Science and Technology, Government of India, in 1997. From 1998 to 2001, he was a UNESCO-TWAS associate member at the Hanoi Institute of Mathematics, Vietnam. He was also a DAAD visiting professor at Universität des Saarlan- des and University of Bonn, Germany, in 2000 and 2007, respectively.

  Dr. Akerkar serves as editor-in-chief of the International Journal of Computer

  

Science & Applications (IJCSA) and as an associate editor of the International

Journal of Metadata, Semantics, and Ontologies (IJMSO). He is co-organizer

  of several workshops and program chair of the international conferences

  ISACA, ISAI, ICAAI, and WIMS. He has co-authored 13 books, approxi- mately 100 research papers, co-edited 2 e-books, and edited 5 volumes of international conferences. He is also actively involved in several interna- tional ICT initiatives and research & development projects and has been for more than 16 years.

  

This page intentionally left blank This page intentionally left blank

   Rajendra Akerkar

  Dipankar Das

  Michael Cochez

  Faculty of Information Technology

  University of Jyväskylä Jyväskylä, Finland

  Fulvio Corno

  Department of Control and Computer Engineering

  Polytechnic University of Turin Turin, Italy

  Department of Computer Science

  Advanced Computing and Electromagnetic Unit

  National University of Singapore Singapore

  Luigi De Russis

  Department of Control and Computer Engineering

  Polytechnic University of Turin Turin, Italy

  Mariano di Claudio

  Department of Systems and Informatics University of Florence Firenze, Italy

  Gillian Dobbie

  Istituto Superiore Mario Boella Torino, Italy

  Giuseppe Caragnano

  Western Norway Research Institute Sogndal, Norway

  Distributed Systems and Internet Technology

  Mario Arias

  Digital Enterprise Research Institute National University of Ireland Galway, Ireland

  Sören Auer

  Enterprise Information Systems Department

  Institute of Computer Science III Rheinische Friedrich-Wilhelms-

  Universität Bonn Bonn, Germany

  Pierfrancesco Bellini

  Department of Systems and Informatics

  Department of Computer Science National University of Singapore Singapore

  University of Florence Firenze, Italy

  Dario Bonino

  Department of Control and Computer Engineering

  Polytechnic University of Turin Turin, Italy

  Diego Calvanese

  Department of Computer Science Free University of Bozen-Bolzano Bolzano, Italy

  Erik Cambria

  Department of Computer Science The University of Auckland Auckland, New Zealand

  Contributors Vadim Ermolayev

  Department of Computer Science

  Bin Jiang

  Department of Technology and Built Environment University of Gävle Gävle, Sweden

  Monika Jungemann-Dorner

  Senior International Project Manager

  Verband der Verein Creditreform eV

  Neuss, Germany

  Jakub Klimek

  University of Leipzig Leipzig, Germany

  Department of Computer Science National and Kapodistrian

  Herald Kllapi

  Department of Computer Science National and Kapodistrian

  University of Athens Athens, Greece

  Manolis Koubarakis

  Department of Computer Science

  National and Kapodistrian University of Athens

  Athens, Greece

  Peep Küngas

  University of Athens Athens, Greece

  Yannis Ioannidis

  Zaporozhye National University Zaporozhye, Ukraine

  Department of Computer Science

  Javier D. Fernández

  Department of Computer Science University of Valladolid Valladolid, Spain

  Philipp Frischmuth

  Department of Computer Science University of Leipzig Leipzig, Germany

  Martin Giese

  Department of Computer Science University of Oslo Oslo, Norway

  Claudio Gutiérrez

  University of Chile Santiago, Chile

  Dutch National Research Center for Mathematics and Computer Science (CWI)

  Peter Haase

  Fluid Operations AG Walldorf, Germany

  Hele-Mai Haav

  Institute of Cybernetics Tallinn University of

  Technology Tallinn, Estonia

  Ian Horrocks

  Department of Computer Science Oxford University Oxford, United Kingdom

  Stratos Idreos

  Institute of Computer Science University of Tartu

  Contributors Maurizio Lenzerini

  Technical University of Catalonia (UPC)

  Department of Computer Science National University of Singapore Singapore

  Özgür Özçep

  Department of Computer Science TU Hamburg-Harburg Hamburg, Germany

  Tassilo Pellegrin

  Semantic Web Company Vienna, Austria

  Jordà Polo

  Barcelona Supercomputing Center (BSC)

  Barcelona, Spain

  Department of Computer Science University of Leipzig Leipzig, Germany

  Dheeraj Rajagopal

  Department of Computer Science National University of Singapore Singapore

  Nadia Rauch

  Department of Systems and Informatics University of Florence Firenze, Italy

  Riccardo Rosati

  Department of Computer Science Sapienza University of Rome Rome, Italy

  Pietro Ruiu

  Daniel Olsher

  Axel-Cyrille Ngonga Ngomo

  Department of Computer Science

  Department of Computer Science

  Sapienza University of Rome Rome, Italy

  Xintao Liu

  Department of Technology and Built Environment University of Gävle Gävle, Sweden

  Miguel A. Martínez-Prieto

  Department of Computer Science

  University of Valladolid Valladolid, Spain

  Ralf Möller

  TU Hamburg-Harburg Hamburg, Germany

  Technology University of Florence

  Lorenzo Mossucca

  Istituto Superiore Mario Boella Torino, Italy

  Mariano Rodriguez Muro

  Department of Computer Science Free University of Bozen-Bolzano Bolzano, Italy

  M. Asif Naeem

  Department of Computer Science The University of Auckland Auckland, New Zealand

  Paolo Nesi

  Department of Systems and Informatics Distributed Systems and Internet

  Istituto Superiore Mario Boella

  Contributors Rudolf Schlatte

  Vagan Terziyan

  Roberto V. Zicari

  Department of Computer Science The University of Auckland Auckland, New Zealand

  Gerald Weber

  Department of Computer Science University of Oslo Oslo, Norway

  Arild Waaler

  Istituto Superiore Mario Boella Torino, Italy

  Advanced Computing and Electromagnetics Unit

  Olivier Terzo

  University of Jyväskylä Jyväskylä, Finland

  Department of Mathematical Information Technology

  Ludwig-Maximilians University Munich, Germany

  Department of Computer Science

  Marcus Spies

  University of Oslo Oslo, Norway

  Department of Computer Science

  Ahmet Soylu

  Istituto Superiore Mario Boella Torino, Italy

  Advanced Computing and Electromagnetics Unit

  Mikhail Simonov

  Fluid Operations AG Walldorf, Germany

  Michael Schmidt

  University of Oslo Oslo, Norway

  Department of Computer Science Goethe University Frankfurt, Germany

  

  

This page intentionally left blank This page intentionally left blank

   Vadim Ermolayev, Rajendra Akerkar, Vagan Terziyan, and Michael Cochez CONTENTS

  Introduction .............................................................................................................4 Motivation and Unsolved Issues ..........................................................................6

  Illustrative Example ...........................................................................................7 Demand in Industry ...........................................................................................9 Problems in Industry .........................................................................................9 Major Issues ...................................................................................................... 11

  State of Technology, Research, and Development in Big Data Computing .. 12 Big Data Processing—Technology Stack and Dimensions ......................... 13 Big Data in European Research ...................................................................... 14

  Complications and Overheads in Understanding Big Data ......................20 Refining Big Data Semantics Layer for Balancing Efficiency Effectiveness ......................................................................................................23

  Focusing ........................................................................................................25 Filtering .........................................................................................................26 Forgetting ......................................................................................................27 Contextualizing ............................................................................................27 Compressing ................................................................................................29 Connecting ....................................................................................................29

  Autonomic Big Data Computing ...................................................................30 Scaling with a Traditional Database ................................................................... 32

  Large Scale Data Processing Workflows .......................................................33 Knowledge Self-Management and Refinement through Evolution ..............34

  Knowledge Organisms, their Environments, and Features .......................36 Environment, Perception (Nutrition), and Mutagens ............................ 37 Knowledge Genome and Knowledge Body ............................................ 39 Morphogenesis............................................................................................. 41 Mutation .......................................................................................................42 Recombination and Reproduction ............................................................44

  Populations of Knowledge Organisms .........................................................45 Fitness of Knowledge Organisms and Related Ontologies ........................46

  Big Data Computing

  Some Conclusions .................................................................................................48 Acknowledgments ................................................................................................50 References ...............................................................................................................50

Introduction

  Big Data is a phenomenon that leaves a rare information professional negli- gent these days. Remarkably, application demands and developments in the context of related disciplines resulted in technologies that boosted data gen- eration and storage at unprecedented scales in terms of volumes and rates. To mention just a few facts reported by Manyika et al. (2011): a disk drive capable of storing all the world’s music could be purchased for about US $600; 30 bil- lion of content pieces are shared monthly only at Facebook . Exponential growth of data volumes is accelerated by a dramatic increase in social networking applications that allow nonspecialist users create a huge amount of content easily and freely. Equipped with rapidly evolving mobile devices, a user is becoming a nomadic gateway boosting the generation of additional real-time sensor data. The emerging Internet of Things makes every thing a data or content, adding billions of additional artificial and autonomic sources of data to the overall picture. Smart spaces, where people, devices, and their infrastructure are all loosely connected, also generate data of unprecedented volumes and with velocities rarely observed before. An expectation is that valuable information will be extracted out of all these data to help improve the quality of life and make our world a better place.

  Society is, however, left bewildered about how to use all these data effi- ciently and effectively. For example, a topical estimate for the number of a need for data-savvy managers to take full advantage of Big Data in the United States is 1.5 million (Manyika et al. 2011). A major challenge would be finding a balance between the two evident facets of the whole Big Data adventure: (a) the more data we have, the more potentially useful patterns it may include and (b) the more data we have, the less the hope is that any machine-learn- ing algorithm is capable of discovering these patterns in an acceptable time frame. Perhaps because of this intrinsic conflict, many experts consider that this Big Data not only brings one of the biggest challenges, but also a most exciting opportunity in the recent 10 years (cf. Fan et al. 2012b)

  The avalanche of Big Data causes a conceptual divide in minds and opin- ions. Enthusiasts claim that, faced with massive data, a scientific approach “. . . hypothesize, model, test—is becoming obsolete. . . . Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical

  Toward Evolving Knowledge Ecosystems for Big Data Understanding

  that Big Data provides “. . . destabilising amounts of knowledge and informa- tion that lack the regulating force of philosophy” (Berry 2011). Indeed, being abnormally big does not yet mean being healthy and wealthy and should be treated appropriately (Figure 1.1): a diet, exercise, medication, or even surgery (philosophy). Those data sets, for which systematic health treatment is ignored in favor of correlations, will die sooner—as useless. There is a hope, however, that holistic integration of evolving algorithms, machines, and people rein- forced by research effort across many domains will guarantee required fitness of Big Data, assuring proper quality at right time (Joseph 2012).

  Mined correlations, though very useful, may hint about an answer to a “what,” but not “why” kind of questions. For example, if Big Data about Royal guards and their habits had been collected in the 1700s’ France, one could mine today that all musketeers who used to have red Burgundy regu- larly for dinners have not survived till now. Pity, red Burgundy was only one of many and a very minor problem. A scientific approach is needed to infer real reasons—the work currently done predominantly by human analysts.

  Effectiveness and efficiency are the evident keys in Big Data analysis. Cradling the gems of knowledge extracted out of Big Data would only be effective if: (i) not a single important fact is left in the burden—which means completeness and (ii) these facts are faceted adequately for further infer- ence—which means expressiveness and granularity. Efficiency may be inter- preted as the ratio of spent effort to the utility of result. In Big Data analytics, it could be straightforwardly mapped to timeliness. If a result is not timely, its utility (Ermolayev et al. 2004) may go down to zero or even far below in seconds to milliseconds for some important industrial applications such as technological process or air traffic control.

  Notably, increasing effectiveness means increasing the effort or making the analysis computationally more complex, which negatively affects efficiency.

Figure 1.1 Evolution of data collections—dimensions (see also Figure 1.3) have to be treated with care.

  Big Data Computing

  Finding a balanced solution with a sufficient degree of automation is the challenge that is not yet fully addressed by the research community.

  One derivative problem concerns knowledge extracted out of Big Data as the result of some analytical processing. In many cases, it may be expected that the knowledge mechanistically extracted out of Big Data will also be big. Therefore, taking care of Big Knowledge (which has more value than the source data) would be at least of the same importance as resolving chal- lenges associated with Big Data processing. Uplifting the problem to the level of knowledge is inevitable and brings additional complications such as resolving contradictory and changing opinions of everyone on everything. Here, an adequate approach in managing the authority and reputation of “experts” will play an important role (Weinberger 2012).

  This chapter offers a possible approach in addressing the problem of “understanding” Big Data in an effective and efficient way. The idea is mak- ing adequately grained and expressive knowledge representations and fact collections evolve naturally, triggered by new tokens of relevant data coming along. Pursuing this way would also imply conceptual changes in the Big Data Processing stack. A refined semantic layer has to be added to it for provid- ing adequate interfaces to interlink horizontal layers and enable knowledge- related functionality coordinated in top-down and bottom-up directions.

  The remainder of the chapter is structured as follows. The “Motivation and Unsolved Issues” section offers an illustrative example and the anal- ysis of the demand for understanding Big Data. The “State of Technology, Research, and Development in Big Data Computing” section reviews the relevant research on using semantic and related technologies for Big Data processing and outlines our approach to refine the processing stack. The “Scaling with a Traditional Database” section focuses on how the basic data storage and management layer could be refined in terms of scalability, which is necessary for improving efficiency/effectiveness. The “Knowledge Self-Management and Refinement through Evolution” section presents our approach, inspired by the mechanisms of natural evolution studied in evo- lutionary biology. We focus on a means of arranging the evolution of knowl- edge, using knowledge organisms, their species, and populations with the aim of balancing efficiency and effectiveness of processing Big Data and its semantics. We also provide our preliminary considerations on assessing fit- ness in an evolving knowledge ecosystem. Our conclusions are drawn in the “Some Conclusions” section.

  Motivation and Unsolved Issues

  Practitioners, including systems engineers, Information Technology archi-

  Toward Evolving Knowledge Ecosystems for Big Data Understanding

  the phenomenon of Big Data in their dialog over means of improving sense- making. The phenomenon remains a constructive way of introducing others, including nontechnologists, to new approaches such as the Apache Hadoop () framework. Apparently, Big Data is collected to be ana- lyzed. “Fundamentally, big data analytics is a workflow that distills terabytes of low-value data down to, in some cases, a single bit of high-value data. . . . The goal is to see the big picture from the minutia of our digital lives” (cf. Fisher et al. 2012). Evidently, “seeing the big picture” in its entirety is the key and requires making Big Data healthy and understandable in terms of effec- tiveness and efficiency for analytics.

  In this section, the motivation for understanding the Big Data that improves the performance of analytics is presented and analyzed. It begins with pre- senting a simple example which is further used throughout the chapter. It continues with the analysis of industrial demand for Big Data analytics. In this context, the major problems as perceived by industries are analyzed and informally mapped to unsolved technological issues.

  illustrative example

  Imagine a stock market analytics workflow inferring trends in share price changes. One possible way of doing this is to extrapolate on stock price data. However, a more robust approach could be extracting these trends from market news. Hence, the incoming data for analysis would very likely be several streams of news feeds resulting in a vast amount of tokens per day. An illustrative example of such a news token is:

  Posted: Tue, 03 Jul 2012 05:01:10-04:00 LONDON (Reuters) U.S. planemaker Boeing hiked its 20-year market forecast, predicting demand for 34,000 new aircraft worth $4.5 trillion, on growth in emerging regions and as airlines seek efficient new planes to coun- * ter high fuel costs.

  Provided that an adequate technology is available, one may extract the knowledge pictured as thick-bounded and gray-shaded elements in Figure 1.2.

  This portion of extracted knowledge is quite shallow, as it simply inter- * prets the source text in a structured and logical way. Unfortunately, it does

  

(accessed July 5,

2012).

The technologies for this are under intensive development currently, for example, wit.istc.

  • * * -baseOf -basedIn

    -builtBy

  Big Data Computing Country Plane -sellsTo Company * MarketForecast PlaneMaker -built * * -successorOf -SalesVolume * -has -by * Airline * * -buysForm -seeksFor -soughtBy * * -fuelConsumption : <unspecified> = low -delivered : Date

-built : <unspecified> = >2009

EfficientNewPlane -hiked by * * -predecessorOf * -hikes

  • owns Owns -ownedBy B787-JA812A : EfficientNewPlane
  • 1

    Terminological component baseOf Built = >2009 Japan : Country basedIn AllNipponAirways : Airline basedIn New20YMarketForecastbyBoeing : MarketForecast Boeing : PlaneMaker built Owned by has by Fuel consumption = 20% lower than others delivered : Date = 2012/07/03 builtBy Individual assertions baseOf UnitedStates : Country hikedBy hikes successorOf SalesVolume Old 20YMarketForecastbyBoeing : MarketForecast SalesVolume = 4.5 trillion predecessorOf

    Figure 1.2 Semantics associated with a news data token.

      not answer several important questions for revealing the motives for Boeing to hike their market forecast: Q1. What is an efficient new plane? How is efficiency related to high fuel costs to be countered? Q2. Which airlines seek for efficient new planes? What are the emerg- ing regions? How could their growth be assessed? Q3. How are plane makers, airlines, and efficient new planes related to each other?