Big Data Computing Ebook free download pdf pdf

Big Data

Computing

This page intentionally left blank This page intentionally left blank Big Data

Computing

Edited by

Rajendra Akerkar

Western Norway Research Institute

Sogndal, Norway

_{Taylor & Francis Group} _{© 2014 by Taylor & Francis Group, LLC} _{Boca Raton, FL 33487-2742} _{6000 Broken Sound Parkway NW, Suite 300} CRC Press _{International Standard Book Number-13: 978-1-4665-7838-8 (eBook - PDF)} _{Version Date: 20131028} _{No claim to original U.S. Government works} CRC Press is an imprint of Taylor & Francis Group, an Informa business

_{copyright holders if permission to publish in this form has not been obtained. If any copyright material has}

_{have attempted to trace the copyright holders of all material reproduced in this publication and apologize to}

_{responsibility for the validity of all materials or the consequences of their use. The authors and publishers}

_{have been made to publish reliable data and information, but the author and publisher cannot assume}

_{not been acknowledged please write and let us know so we may rectify in any future reprint.}

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts

_{without written permission from the publishers.}

_{ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,}

_{including photocopying, microfilming, and recording, or in any information storage or retrieval system,}

_{http://www.crcpress.com} _{and the CRC Press Web site at} _{http://www.taylorandfrancis.com} _{Visit the Taylor & Francis Web site at} only for identification and explanation without intent to infringe.

All the visionary minds who have helped create a modern data science profession

This page intentionally left blank This page intentionally left blank

Contents

Contents

Index ..................................................................................................................... 539
In the international marketplace, businesses, suppliers, and customers create ^* and consume vast amounts of information. Gartner predicts that enterprise data in all forms will grow up to 650% over the next five years. According _† to IDC, the world’s volume of data doubles every 18 months. Digital information is doubling every 1.5 years and will exceed 1000 exabytes next year according to the MIT Centre for Digital Research. In 2011, medical centers held almost 1 billion terabytes of data. That is almost 2000 billion file cabinets’ worth of information. This deluge of data, often referred to as Big Data, obvi- ously creates a challenge to the business community and data scientists.
The term Big Data refers to data sets the size of which is beyond the capabilities of current database technology. It is an emerging field where innovative technology offers alternatives in resolving the inherent problems that appear when working with massive data, offering new ways to reuse and extract value from information.
Businesses and government agencies aggregate data from numerous private and/or public data sources. Private data is information that any organization exclusively stores that is available only to that organization, such as employee data, customer data, and machine data (e.g., user transactions and customer behavior). Public data is information that is available to the public for a fee or at no charge, such as credit ratings, social media content (e.g., LinkedIn, Facebook, and Twitter). Big Data has now reached every sector in the world economy. It is transforming competitive opportunities in every industry sector including banking, healthcare, insurance, manu- facturing, retail, wholesale, transportation, communications, construction, education, and utilities. It also plays key roles in trade operations such as marketing, operations, supply chain, and new business models. It is becoming rather evident that enterprises that fail to use their data efficiently are at a large competitive disadvantage from those that can analyze and act on their data. The possibilities of Big Data continue to evolve swiftly, driven by innovation in the underlying technologies, platforms, and analytical capabilities for handling data, as well as the evolution of behavior among its users as increasingly humans live digital lives.
It is interesting to know that Big Data is different from the conventional data models (e.g., relational databases and data models, or conventional gov- ernance models). Thus, it is triggering organizations’ concern as they try to separate information nuggets from the data heap. The conventional models of structured, engineered data do not adequately reveal the realities of Big _† ^*
http://www.gartner.com/it/content/1258400/1258425/january_6_techtrends_rpaquet.pdf
Preface
Data. The key to leveraging Big Data is to realize these differences before expediting its use. The most noteworthy difference is that data are typically governed in a centralized manner, but Big Data is self-governing. Big Data is created either by a rapidly expanding universe of machines or by users of highly varying expertise. As a result, the composition of traditional data will naturally vary considerably from Big Data. The composition of data serves a specific purpose and must be more durable and structured, whereas Big Data will cover many topics, but not all topics will yield useful information for the business, and thus they will be sparse in relevancy and structure.
The technology required for Big Data computing is developing at a sat- isfactory rate due to market forces and technological evolution. The ever- growing enormous amount of data, along with advanced tools of exploratory data analysis, data mining/machine learning, and data visualization, offers a whole new way of understanding the world.
Another interesting fact about Big Data is that not everything that is con- sidered “Big Data” is in fact Big Data. One needs to explore deep into the scientific aspects, such as analyzing, processing, and storing huge volumes of data. That is the only way of using tools effectively. Data developers/ scientists need to know about analytical processes, statistics, and machine learning. They also need to know how to use specific data to program algorithms. The core is the analytical side, but they also need the scientific back- ground and in-depth technical knowledge of the tools they work with in order to gain control of huge volumes of data. There is no one tool that offers this per se.
As a result, the main challenge for Big Data computing is to find a novel solution, keeping in mind the fact that data sizes are always growing. This solution should be applicable for a long period of time. This means that the key condition a solution has to satisfy is scalability. Scalability is the ability of a system to accept increased input volume without impacting the profits; that is, the gains from the input increment should be proportional to the increment itself. For a system to be totally scalable, the size of its input should not be a design parameter. Pushing the system designer to consider all possible deployment sizes to cope with different input sizes leads to a scalable architecture without primary bottlenecks. Yet, apart from scalability, there are other requisites for a Big Data–intensive computing system.
Although Big Data is an emerging field in data science, there are very few books available in the market. This book provides authoritative insights and highlights valuable lessons learnt by authors—with experience.
Some universities in North America and Europe are doing their part to feed the need for analytics skills in this era of Big Data. In recent years, they have introduced master of science degrees in Big Data analytics, data science, and business analytics. Some contributing authors have been involved in developing a course curriculum in their respective institution and country. The number of courses on “Big Data” will increase worldwide
Preface
of productivity growth, innovation, and consumer surplus, according to a ^* research by MGI and McKinsey’s Business Technology Office.
The main features of this book can be summarized as
1. It describes the contemporary state of the art in a new field of Big Data computing.
2. It presents the latest developments, services, and main players in this explosive field.
3. Contributors to the book are prominent researchers from academia and practitioners from industry.
Organization
This book comprises five sections, each of which covers one aspect of Big Data computing. Section I focuses on what Big Data is, why it is important, and how it can be used. Section II focuses on semantic technologies and Big Data. Section III focuses on Big Data processing—tools, technologies, and methods essential to analyze Big Data efficiently. Section IV deals with business and economic perspectives. Finally, Section V focuses on various stimulating Big Data applications. Below is a brief outline with more details on what each chapter is about.
Section I: Introduction
Chapter 1 provides an approach to address the problem of “understanding” Big Data in an effective and efficient way. The idea is to make adequately grained and expressive knowledge representations and fact collections that evolve naturally, triggered by new tokens of relevant data coming along. The chapter also presents primary considerations on assessing fitness in an evolving knowledge ecosystem.
Chapter 2 then gives an overview of the main features that can character- ize architectures for solving a Big Data problem, depending on the source of data, on the type of processing required, and on the application context in _* which it should be operated.

Preface
Chapter 3 discusses Big Data from three different standpoints: the business, the technological, and the social. This chapter lists some relevant initiatives and selected thoughts on Big Data.
Section II: Semantic Technologies and Big Data
Chapter 4 presents foundations of Big Semantic Data management. The chapter sketches a route from the current data deluge, the concept of Big Data, and the need of machine-processable semantics on the Web. Further, this chapter justifies different management problems arising in Big Semantic Data by characterizing their main stakeholders by role and nature.
A number of challenges arising in the context of Linked Data in Enterprise Integration are covered in Chapter 5. A key prerequisite for addressing these challenges is the establishment of efficient and effective link discovery and data integration techniques, which scale to large-scale data scenarios found in the enterprise. This chapter also presents the transformation step of Linked Data Integration by two algorithms.
Chapter 6 proposes steps toward the solution of the data access problem that end-users usually face when dealing with Big Data. The chapter discusses the state of the art in ontology-based data access (OBDA) and explains why OBDA is the superior approach to the data access challenge posed by Big Data. It also explains why the field of OBDA is currently not yet sufficiently complete to deal satisfactorily with these problems, and it finally presents thoughts on escalating OBDA to a level where it can be well deployed to Big Data.
Chapter 7 addresses large-scale semantic interoperability problems of data in the domain of public sector administration and proposes practical solutions to these problems by using semantic technologies in the context of Web services and open data. This chapter also presents a case of the Estonian semantic interoperability framework of state information systems and related data interoperability solutions.
Section III: Big Data Processing
Chapter 8 presents a new way of query processing for Big Data where data exploration becomes a first-class citizen. Data exploration is desirable when new big chunks of data arrive speedily and one needs to react quickly. This chapter focuses on database systems technology, which for several years has
Preface
Chapter 9 explores the MapReduce model, a programming model used to develop largely parallel applications that process and generate large amounts of data. This chapter also discusses how MapReduce is implemented in Hadoop and provides an overview of its architecture.
A particular class of stream-based joins, namely, a join of a single stream with a traditional relational table, is discussed in Chapter 10. Two available stream-based join algorithms are investigated in this chapter.
Section IV: Big Data and Business
Chapter 11 provides the economic value of Big Data from a macro- and a microeconomic perspective. The chapter illustrates how technology and new skills can nurture opportunities to derive benefits from large, constantly growing, dispersed data sets and how semantic interoperability and new licensing strategies will contribute to the uptake of Big Data as a business enabler and a source of value creation.
Nowadays businesses are enhancing their business intelligence prac- tices to include predictive analytics and data mining. This combines the best of strategic reporting and basic forecasting with advanced operational intelligence and decision-making functions. Chapter 12 discusses how Big Data technologies, advanced analytics, and business intelligence (BI) are interrelated. This chapter also presents various areas of advanced analytic technologies.
Section V: Big Data Applications
The final section of the book covers application topics, starting in Chapter 13 with novel concept-level approaches to opinion mining and sentiment analysis that allow a more efficient passage from (unstructured) textual information to (structured) machine-processable data, in potentially any domain.
Chapter 14 introduces the spChains framework, a modular approach to sup- port mastering of complex event processing (CEP) queries in an abridged, but effective, manner based on stream processing block composition. The approach aims at unleashing the power of CEP systems for teams having reduced insights into CEP systems.
Real-time electricity metering operated at subsecond data rates in a grid with 20 million nodes originates more than 5 petabytes daily. The requested decision-making timeframe in SCADA systems operating load shedding
Preface
optimization task and the data management approach permitting a solution to the issue.
Chapter 16 presents an innovative outlook to the scaling of geographical space using large street networks involving both cities and countryside. Given a street network of an entire country, the chapter proposes to decom- pose the street network into individual blocks, each of which forms a mini- mum ring or cycle such as city blocks and field blocks. The chapter further elaborates the power of the block perspective in reflecting the patterns of geographical space.
Chapter 17 presents the influence of recent advances in natural language processing on business knowledge life cycles and processes of knowledge management. The chapter also sketches envisaged developments and market impacts related to the integration of semantic technology and knowledge management.
Intended Audience
The aim of this book is to be accessible to researchers, graduate students, and to application-driven practitioners who work in data science and related fields. This edited book requires no previous exposure to large-scale data analysis or NoSQL tools. Acquaintance with traditional databases is an added advantage.
This book provides the reader with a broad range of Big Data concepts, tools, and techniques. A wide range of research in Big Data is covered, and comparisons between state-of-the-art approaches are provided. This book can thus help researchers from related fields (such as databases, data science, data mining, machine learning, knowledge engineering, information retrieval, information systems), as well as students who are interested in entering this field of research, to become familiar with recent research developments and identify open research challenges on Big Data. This book can help practitioners to better understand the current state of the art in Big Data techniques, concepts, and applications.
The technical level of this book also makes it accessible to students taking advanced undergraduate level courses on Big Data or Data Science. Although such courses are currently rare, with the ongoing challenges that the areas of intelligent information/data management pose in many organizations in both the public and private sectors, there is a demand worldwide for gradu- ates with skills and expertise in these areas. It is hoped that this book helps address this demand.
In addition, the goal is to help policy-makers, developers and engineers, data scientists, as well as individuals, navigate the new Big Data landscape.
Preface
Acknowledgments
The organization and the contents of this edited book have benefited from our outstanding contributors. I am very proud and happy that these researchers agreed to join this project and prepare a chapter for this book. I am also very pleased to see this materialize in the way I originally envisioned. I hope this book will be a source of inspiration to the readers. I especially wish to express my sincere gratitude to all the authors for their contribution to this project.
I thank the anonymous reviewers who provided valuable feedback and helpful suggestions. I also thank Aastha Sharma, David Fausel, Rachel Holt, and the staff at
CRC Press (Taylor & Francis Group), who supported this book project right from the start.
Last, but not least, a very big thanks to my colleagues at Western Norway Research Institute (Vestlandsforsking, Norway) for their constant encour- agement and understanding.
I wish all readers a fruitful time reading this book, and hope that they experience the same excitement as I did—and still do—when dealing with Data.
Rajendra Akerkar

This page intentionally left blank This page intentionally left blank

Rajendra Akerkar is professor and senior researcher at Western Norway
Research Institute (Vestlandsforsking), Norway, where his main domain of research is semantic technologies with the aim of combining theoretical results with high-impact real-world solutions. He also holds visiting aca- demic assignments in India and abroad. In 1997, he founded and chaired the Technomathematics Research Foundation (TMRF) in India.
His research and teaching experience spans over 23 years in academia including different universities in Asia, Europe, and North America. His research interests include ontologies, semantic technologies, knowledge systems, large-scale data mining, and intelligent systems.
He received DAAD fellowship in 1990 and is also a recipient of the pres- tigious BOYSCASTS Young Scientist award of the Department of Science and Technology, Government of India, in 1997. From 1998 to 2001, he was a UNESCO-TWAS associate member at the Hanoi Institute of Mathematics, Vietnam. He was also a DAAD visiting professor at Universität des Saarlan- des and University of Bonn, Germany, in 2000 and 2007, respectively.
Dr. Akerkar serves as editor-in-chief of the International Journal of Computer

Science & Applications (IJCSA) and as an associate editor of the International

Journal of Metadata, Semantics, and Ontologies (IJMSO). He is co-organizer
of several workshops and program chair of the international conferences
ISACA, ISAI, ICAAI, and WIMS. He has co-authored 13 books, approxi- mately 100 research papers, co-edited 2 e-books, and edited 5 volumes of international conferences. He is also actively involved in several international ICT initiatives and research & development projects and has been for more than 16 years.

This page intentionally left blank This page intentionally left blank
Rajendra Akerkar
Dipankar Das
Michael Cochez
Faculty of Information Technology
University of Jyväskylä Jyväskylä, Finland
Fulvio Corno
Department of Control and Computer Engineering
Polytechnic University of Turin Turin, Italy
Department of Computer Science
Advanced Computing and Electromagnetic Unit
National University of Singapore Singapore
Luigi De Russis
Department of Control and Computer Engineering
Polytechnic University of Turin Turin, Italy
Mariano di Claudio
Department of Systems and Informatics University of Florence Firenze, Italy
Gillian Dobbie
Istituto Superiore Mario Boella Torino, Italy
Giuseppe Caragnano
Western Norway Research Institute Sogndal, Norway
Distributed Systems and Internet Technology
Mario Arias
Digital Enterprise Research Institute National University of Ireland Galway, Ireland
Sören Auer
Enterprise Information Systems Department
Institute of Computer Science III Rheinische Friedrich-Wilhelms-
Universität Bonn Bonn, Germany
Pierfrancesco Bellini
Department of Systems and Informatics
Department of Computer Science National University of Singapore Singapore
University of Florence Firenze, Italy
Dario Bonino
Department of Control and Computer Engineering
Polytechnic University of Turin Turin, Italy
Diego Calvanese
Department of Computer Science Free University of Bozen-Bolzano Bolzano, Italy
Erik Cambria
Department of Computer Science The University of Auckland Auckland, New Zealand
Contributors Vadim Ermolayev
Department of Computer Science
Bin Jiang
Department of Technology and Built Environment University of Gävle Gävle, Sweden
Monika Jungemann-Dorner
Senior International Project Manager
Verband der Verein Creditreform eV
Neuss, Germany
Jakub Klimek
University of Leipzig Leipzig, Germany
Department of Computer Science National and Kapodistrian
Herald Kllapi
Department of Computer Science National and Kapodistrian
University of Athens Athens, Greece
Manolis Koubarakis
Department of Computer Science
National and Kapodistrian University of Athens
Athens, Greece
Peep Küngas
University of Athens Athens, Greece
Yannis Ioannidis
Zaporozhye National University Zaporozhye, Ukraine
Department of Computer Science
Javier D. Fernández
Department of Computer Science University of Valladolid Valladolid, Spain
Philipp Frischmuth
Department of Computer Science University of Leipzig Leipzig, Germany
Martin Giese
Department of Computer Science University of Oslo Oslo, Norway
Claudio Gutiérrez
University of Chile Santiago, Chile
Dutch National Research Center for Mathematics and Computer Science (CWI)
Peter Haase
Fluid Operations AG Walldorf, Germany
Hele-Mai Haav
Institute of Cybernetics Tallinn University of
Technology Tallinn, Estonia
Ian Horrocks
Department of Computer Science Oxford University Oxford, United Kingdom
Stratos Idreos
Institute of Computer Science University of Tartu
Contributors Maurizio Lenzerini
Technical University of Catalonia (UPC)
Department of Computer Science National University of Singapore Singapore
Özgür Özçep
Department of Computer Science TU Hamburg-Harburg Hamburg, Germany
Tassilo Pellegrin
Semantic Web Company Vienna, Austria
Jordà Polo
Barcelona Supercomputing Center (BSC)
Barcelona, Spain
Department of Computer Science University of Leipzig Leipzig, Germany
Dheeraj Rajagopal
Department of Computer Science National University of Singapore Singapore
Nadia Rauch
Department of Systems and Informatics University of Florence Firenze, Italy
Riccardo Rosati
Department of Computer Science Sapienza University of Rome Rome, Italy
Pietro Ruiu
Daniel Olsher
Axel-Cyrille Ngonga Ngomo
Department of Computer Science
Department of Computer Science
Sapienza University of Rome Rome, Italy
Xintao Liu
Department of Technology and Built Environment University of Gävle Gävle, Sweden
Miguel A. Martínez-Prieto
Department of Computer Science
University of Valladolid Valladolid, Spain
Ralf Möller
TU Hamburg-Harburg Hamburg, Germany
Technology University of Florence
Lorenzo Mossucca
Istituto Superiore Mario Boella Torino, Italy
Mariano Rodriguez Muro
Department of Computer Science Free University of Bozen-Bolzano Bolzano, Italy
M. Asif Naeem
Department of Computer Science The University of Auckland Auckland, New Zealand
Paolo Nesi
Department of Systems and Informatics Distributed Systems and Internet
Istituto Superiore Mario Boella
Contributors Rudolf Schlatte
Vagan Terziyan
Roberto V. Zicari
Department of Computer Science The University of Auckland Auckland, New Zealand
Gerald Weber
Department of Computer Science University of Oslo Oslo, Norway
Arild Waaler
Istituto Superiore Mario Boella Torino, Italy
Advanced Computing and Electromagnetics Unit
Olivier Terzo
University of Jyväskylä Jyväskylä, Finland
Department of Mathematical Information Technology
Ludwig-Maximilians University Munich, Germany
Department of Computer Science
Marcus Spies
University of Oslo Oslo, Norway
Department of Computer Science
Ahmet Soylu
Istituto Superiore Mario Boella Torino, Italy
Advanced Computing and Electromagnetics Unit
Mikhail Simonov
Fluid Operations AG Walldorf, Germany
Michael Schmidt
University of Oslo Oslo, Norway
Department of Computer Science Goethe University Frankfurt, Germany

This page intentionally left blank This page intentionally left blank
Vadim Ermolayev, Rajendra Akerkar, Vagan Terziyan, and Michael Cochez CONTENTS
Introduction .............................................................................................................4 Motivation and Unsolved Issues ..........................................................................6
Illustrative Example ...........................................................................................7 Demand in Industry ...........................................................................................9 Problems in Industry .........................................................................................9 Major Issues ...................................................................................................... 11
State of Technology, Research, and Development in Big Data Computing .. 12 Big Data Processing—Technology Stack and Dimensions ......................... 13 Big Data in European Research ...................................................................... 14
Complications and Overheads in Understanding Big Data ......................20 Refining Big Data Semantics Layer for Balancing Efficiency Effectiveness ......................................................................................................23
Focusing ........................................................................................................25 Filtering .........................................................................................................26 Forgetting ......................................................................................................27 Contextualizing ............................................................................................27 Compressing ................................................................................................29 Connecting ....................................................................................................29
Autonomic Big Data Computing ...................................................................30 Scaling with a Traditional Database ................................................................... 32
Large Scale Data Processing Workflows .......................................................33 Knowledge Self-Management and Refinement through Evolution ..............34
Knowledge Organisms, their Environments, and Features .......................36 Environment, Perception (Nutrition), and Mutagens ............................ 37 Knowledge Genome and Knowledge Body ............................................ 39 Morphogenesis............................................................................................. 41 Mutation .......................................................................................................42 Recombination and Reproduction ............................................................44
Populations of Knowledge Organisms .........................................................45 Fitness of Knowledge Organisms and Related Ontologies ........................46
Big Data Computing
Some Conclusions .................................................................................................48 Acknowledgments ................................................................................................50 References ...............................................................................................................50
Introduction
Big Data is a phenomenon that leaves a rare information professional negli- gent these days. Remarkably, application demands and developments in the context of related disciplines resulted in technologies that boosted data generation and storage at unprecedented scales in terms of volumes and rates. To mention just a few facts reported by Manyika et al. (2011): a disk drive capable of storing all the world’s music could be purchased for about US $600; 30 billion of content pieces are shared monthly only at Facebook . Exponential growth of data volumes is accelerated by a dramatic increase in social networking applications that allow nonspecialist users create a huge amount of content easily and freely. Equipped with rapidly evolving mobile devices, a user is becoming a nomadic gateway boosting the generation of additional real-time sensor data. The emerging Internet of Things makes every thing a data or content, adding billions of additional artificial and autonomic sources of data to the overall picture. Smart spaces, where people, devices, and their infrastructure are all loosely connected, also generate data of unprecedented volumes and with velocities rarely observed before. An expectation is that valuable information will be extracted out of all these data to help improve the quality of life and make our world a better place.
Society is, however, left bewildered about how to use all these data efficiently and effectively. For example, a topical estimate for the number of a need for data-savvy managers to take full advantage of Big Data in the United States is 1.5 million (Manyika et al. 2011). A major challenge would be finding a balance between the two evident facets of the whole Big Data adventure: (a) the more data we have, the more potentially useful patterns it may include and (b) the more data we have, the less the hope is that any machine-learning algorithm is capable of discovering these patterns in an acceptable time frame. Perhaps because of this intrinsic conflict, many experts consider that this Big Data not only brings one of the biggest challenges, but also a most exciting opportunity in the recent 10 years (cf. Fan et al. 2012b)
The avalanche of Big Data causes a conceptual divide in minds and opinions. Enthusiasts claim that, faced with massive data, a scientific approach “. . . hypothesize, model, test—is becoming obsolete. . . . Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical
Toward Evolving Knowledge Ecosystems for Big Data Understanding
that Big Data provides “. . . destabilising amounts of knowledge and information that lack the regulating force of philosophy” (Berry 2011). Indeed, being abnormally big does not yet mean being healthy and wealthy and should be treated appropriately (Figure 1.1): a diet, exercise, medication, or even surgery (philosophy). Those data sets, for which systematic health treatment is ignored in favor of correlations, will die sooner—as useless. There is a hope, however, that holistic integration of evolving algorithms, machines, and people rein- forced by research effort across many domains will guarantee required fitness of Big Data, assuring proper quality at right time (Joseph 2012).
Mined correlations, though very useful, may hint about an answer to a “what,” but not “why” kind of questions. For example, if Big Data about Royal guards and their habits had been collected in the 1700s’ France, one could mine today that all musketeers who used to have red Burgundy regu- larly for dinners have not survived till now. Pity, red Burgundy was only one of many and a very minor problem. A scientific approach is needed to infer real reasons—the work currently done predominantly by human analysts.
Effectiveness and efficiency are the evident keys in Big Data analysis. Cradling the gems of knowledge extracted out of Big Data would only be effective if: (i) not a single important fact is left in the burden—which means completeness and (ii) these facts are faceted adequately for further infer- ence—which means expressiveness and granularity. Efficiency may be inter- preted as the ratio of spent effort to the utility of result. In Big Data analytics, it could be straightforwardly mapped to timeliness. If a result is not timely, its utility (Ermolayev et al. 2004) may go down to zero or even far below in seconds to milliseconds for some important industrial applications such as technological process or air traffic control.
Notably, increasing effectiveness means increasing the effort or making the analysis computationally more complex, which negatively affects efficiency.
Figure 1.1 Evolution of data collections—dimensions (see also Figure 1.3) have to be treated with care.
Big Data Computing
Finding a balanced solution with a sufficient degree of automation is the challenge that is not yet fully addressed by the research community.
One derivative problem concerns knowledge extracted out of Big Data as the result of some analytical processing. In many cases, it may be expected that the knowledge mechanistically extracted out of Big Data will also be big. Therefore, taking care of Big Knowledge (which has more value than the source data) would be at least of the same importance as resolving challenges associated with Big Data processing. Uplifting the problem to the level of knowledge is inevitable and brings additional complications such as resolving contradictory and changing opinions of everyone on everything. Here, an adequate approach in managing the authority and reputation of “experts” will play an important role (Weinberger 2012).
This chapter offers a possible approach in addressing the problem of “understanding” Big Data in an effective and efficient way. The idea is making adequately grained and expressive knowledge representations and fact collections evolve naturally, triggered by new tokens of relevant data coming along. Pursuing this way would also imply conceptual changes in the Big Data Processing stack. A refined semantic layer has to be added to it for provid- ing adequate interfaces to interlink horizontal layers and enable knowledge- related functionality coordinated in top-down and bottom-up directions.
The remainder of the chapter is structured as follows. The “Motivation and Unsolved Issues” section offers an illustrative example and the analysis of the demand for understanding Big Data. The “State of Technology, Research, and Development in Big Data Computing” section reviews the relevant research on using semantic and related technologies for Big Data processing and outlines our approach to refine the processing stack. The “Scaling with a Traditional Database” section focuses on how the basic data storage and management layer could be refined in terms of scalability, which is necessary for improving efficiency/effectiveness. The “Knowledge Self-Management and Refinement through Evolution” section presents our approach, inspired by the mechanisms of natural evolution studied in evo- lutionary biology. We focus on a means of arranging the evolution of knowledge, using knowledge organisms, their species, and populations with the aim of balancing efficiency and effectiveness of processing Big Data and its semantics. We also provide our preliminary considerations on assessing fitness in an evolving knowledge ecosystem. Our conclusions are drawn in the “Some Conclusions” section.
Motivation and Unsolved Issues
Practitioners, including systems engineers, Information Technology archi-
Toward Evolving Knowledge Ecosystems for Big Data Understanding
the phenomenon of Big Data in their dialog over means of improving sense- making. The phenomenon remains a constructive way of introducing others, including nontechnologists, to new approaches such as the Apache Hadoop () framework. Apparently, Big Data is collected to be analyzed. “Fundamentally, big data analytics is a workflow that distills terabytes of low-value data down to, in some cases, a single bit of high-value data. . . . The goal is to see the big picture from the minutia of our digital lives” (cf. Fisher et al. 2012). Evidently, “seeing the big picture” in its entirety is the key and requires making Big Data healthy and understandable in terms of effectiveness and efficiency for analytics.
In this section, the motivation for understanding the Big Data that improves the performance of analytics is presented and analyzed. It begins with pre- senting a simple example which is further used throughout the chapter. It continues with the analysis of industrial demand for Big Data analytics. In this context, the major problems as perceived by industries are analyzed and informally mapped to unsolved technological issues.
illustrative example
Imagine a stock market analytics workflow inferring trends in share price changes. One possible way of doing this is to extrapolate on stock price data. However, a more robust approach could be extracting these trends from market news. Hence, the incoming data for analysis would very likely be several streams of news feeds resulting in a vast amount of tokens per day. An illustrative example of such a news token is:
Posted: Tue, 03 Jul 2012 05:01:10-04:00 LONDON (Reuters) U.S. planemaker Boeing hiked its 20-year market forecast, predicting demand for 34,000 new aircraft worth $4.5 trillion, on growth in emerging regions and as airlines seek efficient new planes to coun- ^* ter high fuel costs. _†
Provided that an adequate technology is available, one may extract the knowledge pictured as thick-bounded and gray-shaded elements in Figure 1.2.
This portion of extracted knowledge is quite shallow, as it simply inter- _* prets the source text in a structured and logical way. Unfortunately, it does

(accessed July 5,
_† 2012).
The technologies for this are under intensive development currently, for example, wit.istc.
_{* * -baseOf -basedIn}
_-builtBy
Big Data Computing Country Plane _-sellsTo _Company _{* MarketForecast} _PlaneMaker _-built _{* * -successorOf} _-SalesVolume _{* -has -by} _* _Airline _* _{* -buysForm} _{-seeksFor -soughtBy} _{* *} _{-fuelConsumption : <unspecified> = low} _{-delivered : Date}
_{-built : <unspecified> = >2009}
_{EfficientNewPlane} _{-hiked by} * * -predecessorOf _{* -hikes}
owns _Owns _-ownedBy _{B787-JA812A : EfficientNewPlane}

₁
Terminological component _{baseOf Built = >2009} _{Japan : Country} _{basedIn AllNipponAirways : Airline} _{basedIn New20YMarketForecastbyBoeing : MarketForecast} _{Boeing : PlaneMaker} _built _{Owned by} _{has by} _{Fuel consumption = 20% lower than others} _{delivered : Date = 2012/07/03} _builtBy Individual assertions _baseOf _{UnitedStates : Country hikedBy} _{hikes successorOf} _SalesVolume _{Old 20YMarketForecastbyBoeing : MarketForecast} SalesVolume = 4.5 trillion _{predecessorOf}
Figure 1.2 Semantics associated with a news data token.
not answer several important questions for revealing the motives for Boeing to hike their market forecast: Q1. What is an efficient new plane? How is efficiency related to high fuel costs to be countered? Q2. Which airlines seek for efficient new planes? What are the emerging regions? How could their growth be assessed? Q3. How are plane makers, airlines, and efficient new planes related to each other?

Big Data Computing Ebook free download pdf pdf

Organization

Section I: Introduction

Section II: Semantic Technologies and Big Data

Section III: Big Data Processing

Section IV: Big Data and Business

Section V: Big Data Applications

Intended Audience

Acknowledgments

Introduction

Dokumen yang terkait

Moral Epistemology pdf ebook free download

Wickedness Jun 2001 pdf ebook free download

Gameplay and Design ebook free download pdf

Helldribble ebook free download pdf

stream processing Ebook free download pdf pdf

Big Data and Healthy Analytics pdf pdf

new design fundamentals Ebook free download pdf pdf

The Data Driven Leader Ebook free download pdf pdf

Java Data Analysis Ebook free download pdf pdf

NET Core 2 0 By Example Ebook free download pdf pdf

Dukungan

Links

Big Data Computing Ebook free download pdf pdf

Organization

Section I: Introduction

Section II: Semantic Technologies and Big Data

Section III: Big Data Processing

Section IV: Big Data and Business

Section V: Big Data Applications

Intended Audience

Acknowledgments

Introduction

Dokumen yang terkait

Moral Epistemology pdf ebook free download

Wickedness Jun 2001 pdf ebook free download

Gameplay and Design ebook free download pdf

Helldribble ebook free download pdf

stream processing Ebook free download pdf pdf

Big Data and Healthy Analytics pdf pdf

new design fundamentals Ebook free download pdf pdf

The Data Driven Leader Ebook free download pdf pdf

Java Data Analysis Ebook free download pdf pdf

NET Core 2 0 By Example Ebook free download pdf pdf

Dokumen yang Anda mencari sudah siap untuk unduhkan