Big Data Computational Intelligence Networking 4 pdf pdf

  

Big Data and

Computational Intelligence

in Networking

  

Big Data and

Computational Intelligence

in Networking

  

Edited by

Y U L E I W U

F E I H U

G E YO N G M I N

A L B E R T Y. Z O M AYA CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2018 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper International Standard Book Number-13: 978-1-4987-8486-3 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable

efforts have been made to publish reliable data and information, but the author and publisher cannot

assume responsibility for the validity of all materials or the consequences of their use. The authors and

publishers have attempted to trace the copyright holders of all material reproduced in this publication

and apologize to copyright holders if permission to publish in this form has not been obtained. If any

copyright material has not been acknowledged please write and let us know so we may rectify in any

future reprint.

  

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,

transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or

hereafter invented, including photocopying, microfilming, and recording, or in any information

storage or retrieval system, without written permission from the publishers.

  

For permission to photocopy or use material electronically from this work, please access www.

copyright.contact the Copyright Clearance Center, Inc. (CCC),

222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that

provides licenses and registration for a variety of users. For organizations that have been granted a

photocopy license by the CCC, a separate system of payment has been arranged.

  Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

  

Library of Congress Cataloging-in-Publication Data

Names: Wu, Yulei, editor.

  Title: Big data and computational intelligence in networking / Yulei Wu, Fei Hu, Geyong Min, Albert Y. Zomaya. Description: Boca Raton, FL : CRC Press, [2018] | Includes bibliographical references and index. Identifiers: LCCN 2017028407| ISBN 9781498784863 (hardback : acid-free paper) | ISBN 9781315155678 (e-book) | ISBN 9781498784870 (e-book) | ISBN 9781351651721 (e-book) | ISBN 9781351642200 (e-book)

Subjects: LCSH: Big data. | Cloud computing. | Computer networks--Management.

| Computational intelligence. Classification: LCC QA76.9.B45 W824 2018 | DDC 005.7--dc23 LC record available a

  Visit the Taylor & Francis Web site at and the CRC Press Web site at

  Contents

  

  

  

   Marcello Trovati

  

  

   Immanuel Manohar and Fei Hu

  

5. Efficient Big Data Transfer Using Bandwidth Reservation Service in High-Performance Networks . . . . . . . . . . . . . . . . . . .

   Liudong Zuo and Michelle Mengxia Zhu

  vi Contents

   Mehdi Bahrami and Mukesh Singhal

   Deze Zeng, Chengyu Hu, Guo Ren, and Lin Gu

8. The ADMM and Its Application to Network Big Data . . . . . . . 151

  Nan Lin and Liqun Yu

   Vasileios Karyotis and Eleni Stai

  

  

Aziyati Yusoff, Norashidah Md. Din, Salman Yussof, Assad Abbas,

and Samee U. Khan

  

Joonseok Lee

  

Ion Necoara

  

Xiaoqiang Ma, Xiaoyi Fan, and Jiangchuan Liu

  

Seunghak Lee

  

  vii Contents

   Vincent W. Zheng and Hong Cao

  

Chengqiang Huang, Yulei Wu, Zuo Yuan, and Geyong Min

   Reem Y. Ali, Venkata M.V. Gunturi, Zhe Jiang, and Shashi Shekhar

   Vishrawas Gopalakrishnan and Aidong Zhang

  

Z. Y. Liu and Y. B. Guo

  

  

  Recent years have witnessed a deluge of network data propelled by the emerging online social media, user-generated video contents, and global-scale communi- cations, bringing people into the era of big data. Such network big data holds much critical and valuable information including customer experiences, user behaviors, service levels, and other contents, which could significantly improve the efficiency, effectiveness, and intelligence on the optimization of the current Internet, facilitate the smart network operation and management, and help service providers and content providers reduce capital expenditure (CapEx) and opera- tional expenditure (OpEx) while maintaining a relatively high-level quality of service (QoS) and quality of experience (QoE).

  Typical examples of network intelligence received from network big data include rapid QoE impairment detection and mitigation, optimization of network asset utilization, proactive maintenance, rapid outage restoration, and graceful disaster recovery. These aims can be achieved from high-level computational intelligence based on emerging analytical techniques such as big data pro- cessing, Web analytics, and network analytics employing software tools from advanced analytics disciplines such as machine learning, data mining, and pre- dictive analytics. The computational intelligence for big data analysis is playing an ever-increasingly important role in supporting the evolution of the current Internet toward the next-generation intelligent Internet.

  However, the unstructured, heterogeneous, sheer volume and complex nature of network big data pose great challenges on the computational intelligence of these emerging analytical techniques due to high computational overhead and communication cost, non-real-time response, sparse matrix-vector multi- plications, and high convergence time. It is therefore of critical importance to understand network big data and design novel solutions of computational intelligence, scaling up for big data analytics of large-scale networks to auto- matically discover the hidden and valuable information available for smart network operations, management, and optimization. This has been established as

  x Preface

  a new cross-discipline research topic in computer science, requiring anticipation of technical and practical challenges faced by mixed methods across multiple disciplines.

  In this book, we have invited world experts in this area to contribute the chapters that cover the following four parts:

  1. Part 1: Basics of Networked Big Data: This part helps understand the prop- erties, characteristics, challenges, and opportunities of networked big data, geospatial data, and wireless big data. This part covers the following:

  a. Mathematical properties: A variety of aspects related to networks, including their topological and dynamical properties, as well as their applications to real-world examples

  b. Geospatial data and geospatial semantic web: Challenges and opportunities of the geospatial semantic web brought for sharing and utilizing big geospatial data

  c. Big data over wireless networks: Typical scenarios, various chal- lenges, and potential solutions for wireless transmission of big data

  2. Part 2: Network Architecture for Big Data Transmissions: This part presents new proposals and network architectures to ensure efficient big data transmissions and streaming big data processing.

  a. Big data transfer: Challenges of bandwidth reservation service for efficient big data transfer and the potential solutions b. Internet of Things (IoT): A dynamic and independent Cloud com- puting architecture based on a service-oriented architecture for IoT devices, to allow users to freely transfer their IoT devices from one vendor to another c. Streaming big data processing: How to maximize QoS and minimize

  OpEx when performing task scheduling and resource allocation in geo-distributed Clouds

  3. Part 3: Analysis and Processing of Networked Big Data: This part explains how to perform big data analytics based on emerging analytical techniques such as big data analytics, Web analytics, network analytics, and advanced analytics disciplines such as machine learning, data mining, and predictive analytics. This part covers the following areas:

  a. Alternating direction method of multiplier (ADMM): Its applications to large-scale network optimizations b. Dynamic network management and optimization: Rethink of current

  xi Preface

  network evolution on the computation of key network metrics; hyperbolic big data analytics c. Predictive analytics and smart retrieval: Utilize the network big data by performing a data, information, knowledge, and wisdom (DIKW) hierarchy to the product of its processes

  d. Recommendation systems: Key challenges and solutions for data sparsity problem, data scale issue, and cold-start problem e. Coordinate gradient descent methods: Unconstrained convex mini- mization problems with differentiable objective function in network problems

  f. MapReduce: Data locality and dependency analysis; dependency- aware locality for MapReduce g. Distributed machine learning: Big data and big models for network big data; how to parallelize parameter updates on multiple work- ers; how to synchronize concurrent parameter updates performed by multiple workers h. Big graph: Big graph decomposition; real-time and large-scale graph processing; big data security

  4. Part 4: Emerging Applications of Networked Big Data: This part covers some emerging applications on the following: a. Intelligent mall shopping: Location-based mobile augmented real- ity applications; using network data to enable intelligent shopping; robust feature learning in cold-start heterogeneous-device localiza- tion; learning to query in the cold-start retailer content b. Network anomaly detection: How to efficiently use network big data to perform accurate anomaly detection c. Transportation: Advances of spatial network big data (SNBD) tech- niques; challenges posed by SNBD in transportation applications and the potential solutions

  d. Biomedical and social media domain: Graph as a representation schema for big data; graph-based models and analyses in social text mining, and bioinformatics and biomedical

  e. Smart manufacturing: Big data characteristics in manufacturing; data collection and data mining in manufacturing; applications of big data in manufacturing. This book presents the state-of-the-art solutions to the theoretical and prac- tical challenges stemming from the leverage of big data and its computational intelligence in supporting smart network operation, management, and optimiza-

  xii Preface

  of network big data, efficient collection and management of network big data, distributed and scalable online analytics for network big data, and emerging applications of network big data for computational intelligence.

  Targeted audiences

  : This book targets both academia and industry readers. Grad- uate students can select promising research topics from this book that are suitable for their thesis or dissertation research. Researchers will have a deep under- standing of the challenging issues and opportunities of network big data and can thus easily find an unsolved research problem to pursue. Industry engineers from IT companies, service providers, content providers, network operators, and equipment manufacturers can get to know the engineering design issues and cor- responding solutions after reading some practical schemes described in some chapters.

  We have required all chapter authors to provide as much technical detail as possible. Each chapter also includes references for readers’ further studies and investigations. If you have any comments or questions on certain chapters, please contact the chapter authors for more information.

  Thank you for reading this book. We wish that this book will help you with the scientific research and practical problems of network big data.

   Assad Abbas North Dakota State University Fargo, North Dakota Reem Y. Ali Department of Computer Science University of Minnesota Minneapolis, Minnesota Mehdi Bahrami Cloud Lab University of California, Merced Merced, California Hong Cao Analytics & Enterprise Intelligence Ernst & Young Advisory Pte. Ltd Singapore Xiaoyi Fan School of Computing Science Simon Fraser University Burnaby, British Columbia, Canada Vishrawas Gopalakrishnan Department of Computer Science and Engineering State University of New York Buffalo, New York

  Lin Gu Huazhong University of Science and Technology Wuhan, China Venkata M.V. Gunturi Department of Computer Science Indraprastha Institute of Information

  Technology Delhi, India Y. B. Guo Department of Mechanical Engineering University of Alabama Tuscaloosa, Alabama Chengyu Hu China University of Geosciences Wuhan, China Fei Hu Electrical and Computer Engineering University of Alabama Tuscaloosa, Alabama Chengqiang Huang University of Exeter Exeter, United Kingdom

  xiv Contributors

  Zhe Jiang Department of Computer Science University of Alabama Tuscaloosa, Alabama Vasileios Karyotis School of Electrical and Computer Engineering National Technical University of Athens (NTUA) Athens, Greece Samee U. Khan North Dakota State University Fargo, North Dakota Joonseok Lee Google Research Video Understanding Team, Machine

  Perception Mountain View, California Seunghak Lee Human Longevity, Inc., San Diego, California Weidong Li Department of Geography & Center for Environmental Sciences and

  Engineering University of Connecticut Storrs, Connecticut Nan Lin Department of Mathematics Washington University in St. Louis St. Louis, Missouri Jiangchuan Liu School of Computing Science Simon Fraser University Burnaby, British Columbia, Canada Z. Y. Liu Department of Mechanical Engineering University of Alabama Tuscaloosa, Alabama

  Yu Lu Electrical and Computer Engineering University of Alabama Tuscaloosa, Alabama Xiaoqiang Ma School of Electronic Information and Communications Huazhong University of Science and Technology Wuhan, China Immanuel Manohar Electrical and Computer Engineering University of Alabama Tuscaloosa, Alabama Norashidah Md. Din Universiti Tenaga Nasional Selangor, Malaysia Geyong Min University of Exeter Exeter, United Kingdom Ion Necoara Automatic Control and Systems Engineering Department University Politehnica Bucharest Bucharest, Romania Guo Ren China University of Geosciences Wuhan, China Shashi Shekhar Department of Computer Science University of Minnesota Minneapolis, Minnesota Mukesh Singhal Cloud Lab University of California, Merced Merced, California

  Contributors xv Eleni Stai

  School of Computer & Communication Sciences ´ Ecole Polytechnique F´ed´erale de

  Lausanne Lausanne, Switzerland Marcello Trovati Department of Computer Science Edge Hill University Ormskirk, United Kingdom Yulei Wu University of Exeter Exeter, United Kingdom Liqun Yu Department of Mathematics Washington University in St. Louis St. Louis, Missouri Zuo Yuan University of Exeter Exeter, United Kingdom Aziyati Yusoff Universiti Tenaga Nasional Selangor, Malaysia Salman Yussof Universiti Tenaga Nasional Selangor, Malaysia Deze Zeng China University of Geosciences Wuhan, China Aidong Zhang Department of Computer Science and Engineering State University of New York Buffalo, New York

  Chuanrong Zhang Department of Geography & Center for Environmental Sciences and

  Engineering University of Connecticut Storrs, Connecticut Tian Zhao Department of Computer Science University of Wisconsin-Milwaukee Milwaukee, Wisconsin Vincent W. Zheng Advanced Digital Sciences Center Singapore Michelle Mengxia Zhu Department of Computer Science Montclair State University Montclair, New Jersey Yujia Zhu University of Exeter Exeter, United Kingdom Albert Zomaya The University of Sydney Sydney, Australia Liudong Zuo Computer Science Department California State University,

  Dominguez Hills Carson, California

  

Chapter 1 A Survey of Big Data and Computational Intelligence in Networking Yujia Zhu, Yulei Wu, and Geyong Min University of Exeter Exeter, United Kingdom Albert Zomaya The University of Sydney Sydney, Australia Fei Hu University of Alabama Tuscaloosa, Alabama CONTENTS

  

  

  

  

  

1.2.1.2 Distributed and parallel data processing

   Big Data and Computational Intelligence in Networking

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  With the rapid development of computing capabilities and storage techniques, there is no doubt that we are facing a series of opportunities and challenges brought about by the big data era. Simply put, big data not only injects wis- dom into decision-making and performance improvement but also poses a great threat to traditional data processing techniques. For example, NeteaseMusic is a popular musical software among Chinese teenagers and it is well known for a particular function, Daily Recommendation, which aims to provide users with the songs in which they may be interested. It tracks what songs indi- vidual users listen to most frequently and what pages they look at; it usually develops some relevant tags for each song, such as “sad,” “pleasure,” “Country music,” and “Quiet.” Besides these tags, it is able to utilize some initial infor-

  A Survey of Big Data and Computational Intelligence in Networking

  better understand the customers by taking advantage of the digital data men- tioned above. Customers can respond to each song with “like” or “dislike” by highlighting, or not, a heart tag beneath the song. This process is similar to a training process in machine learning. According to these valuable data, indicat- ing users’ preference NeteaseMusic can recommend songs to users with very high accuracy.

  The success of NeteaseMusic can showcase the benefits that big data brings— smarter decisions and better predictions. By utilizing it appropriately, we can explore some fresh areas so far dominated by intuition and gut. Big data spreads to every aspect in people’s daily lives, changing long-standing ideas about the value of experience, the nature of expertise, and the practice of management [1].

  Networking has become an indispensable part of the modern world, pro- viding convenient access to the Internet, remote communication, information exchange, etc. As mentioned above, big data also has a great influence on net- working. The amount of data has exploded in recent years, with a transformation from traditional simple-structured physical networks to complex-structured vir- tual networks. Considering network management, a novel method is to collect unstructured system logs (which can reach the magnitude of terabytes (TB) per day) and analyze these valuable data to evaluate system performance or tackle faults. So far, we have explored what opportunities networked big data brings, but the challenges are also significant. In the next chapter, a variety of challenges will be discussed.

  

  Networked big data, in fact, is not simply another way of saying “analytics” or large amount of data. In addition to the common features of veracity and value, networked big data has higher volume, velocity, and variety compared with other forms of big data.

  1. Volume: According to a recent report from IBM, 2.5 quintillion bytes of data are created every day [2]. The Web includes various sources of data like cookies, browsing history, and so on, which makes it an extraordinary data set. Many search engines like Google and Yahoo are responsible for dealing with billions of user access requests every day. When entering the keyword “big data” in Google Chrome, about 341,000,000 results are returned in 0.59 seconds (as of May 2017).

  2. Velocity: The large volume of data has increased the complexity of pro- cessing and storing, while for some applications, real-time or nearly real-time data analysis is much more important given the high speed of data creation. Given troubleshooting in networks, this has been an open

  Big Data and Computational Intelligence in Networking

  hardware components like routers or links down to virtual machines, there are a number of fault management tools such as HP Openview and Open- NMS, which focus on real-time fault localization or prediction. Thus, how to quickly and accurately handle faults in order to reduce human intervention is critical to making a company more competitive.

  3. Variety: Networked big data comes from various sources: for instance, messages, images or videos from social networks, event logs and con- figuration parameters from systems, measurements from sensors, and many more. Specifically speaking, every second, on average, around 6000 tweets are posted on Twitter, which corresponds to over 350,000 tweets sent per minute, 500 million tweets per day, and around 200 billion tweets per year [3]. On 4 October 2012, the first presidential debate between President Barack Obama and Governor Mitt Romney triggered more than 10 million tweets within 2 hours [4]. A vast number of users post tweets in various formats such as texts, images, or videos contributing to data diversity.

  A network application in the era of big data usually involves data sources from different geographically distributed data centers. Thus, in addition to the general nature of big data such as large volume, high velocity, and high variety, net- worked big data possesses many unique features, which are summarized below. These unique features also bring some research issues of networked big data, as shown in

  An emerging wave of Internet deployments, most notably the Internet of Things (IoT) [5] and the integration of IoT and fog computing [6], often depend on a distributed networking system to collect data from geographically distributed

  How to maximize big data transmission throughput? Big data Big data sources destination (sender)

  (receiver) Security and privacy issues How to minimize end-to-end delay? A Survey of Big Data and Computational Intelligence in Networking

  sources, such as sensors and data centers. For example, visualizing data collected within the IoT are geo-related and sparsely distributed. Internet geographic infor- mation system (GIS)-based solutions are required to cope with this challenge. Apache Hadoop [7] is well known for its distributed storage and processing of big data. Hadoop Distributed File System (HDFS) [8] is the core part of the storage system in Hadoop, which is a distributed file system that stores data on commodity machines, providing very high aggregate bandwidth across the clus- ters. HDFS was inspired by Google file system (GFS) [9], which is the most widely adopted mechanism for distributed file systems.

  

  In light of the way the networked big data are collected, it is straightforward to process these data in a distributed and parallel manner. There have been several well-known frameworks available for distributed big data processing, e.g., Apache Hadoop [7], Apache Storm [10], Apache Spark [11], and Apache Flink [12]. Hadoop is the first major big data processing framework that provides batch processing based on its MapReduce processing engine. Since it heavily leverages permanent storage, each task involves multiple instances of reading and writing operations. When using Hadoop, time should not be a significant factor. In contrast to batch processing, stream processing systems compute over data as it enters the system, and thus could well serve the processing with near real-time demands. Storm is the first major stream processing framework for big data analytics that focuses on extremely low latency, but does not provide a batch processing mode. Apache Spark provides a hybrid processing system, where it is a batch processing framework with stream processing capabilities. Spark focuses on speeding up batch processing workloads by offering full in-memory compu- tation and processing optimization. It provides a good candidate for those with diverse processing workloads. Apache Flink offers a stream processing frame- work with the support for traditional batch processing models. It treats batch processing as an extension of stream processing by reading a bounded data set off persistent storage.

  

  Under the distributed and decentralized control of large volume, heterogeneous, autonomous sources, big data seeks to explore the complex and ever-changing relationship among data. Apart from the great challenges that large volume poses to big data, how to effectively extract the relationships underneath data has attracted particular attention not only in academic but also in industrial fields. People establish friendship circles based on common hobbies, biological rela- tionships, and many more in human society, while these social connections also exist in the cyber world. For example, a majority of social networks recom-

  Big Data and Computational Intelligence in Networking

  and followers. Mining the correlations between users’ behavior and preference can reveal some underlying but valuable information for these social network operators. In this dynamic world, there is no doubt that these relationships evolve all the time, temporally and spatially. Another example is troubleshoot- ing in networks, where topology graphs can be regarded as spatial information while key performance indicators (KPI) or traffic records represented in time series can be regarded as temporal information. Kimura et al. [13] once claimed that log messages should be characterized as spatial-temporal data, since event logs are usually generated from distributed hosts, routers, or switches spanning across several geographical locations (spatial information) and include times- tamps indicating generating time (temporal information). Based on deep insights of spatial-temporal features of logs, they proposed two novel techniques, sta- tistical template extraction (STE) and log tensor factorization (LTF), to capture particular patterns of log messages which provide useful knowledge into root cause analysis.

  Traditional data mining algorithms usually require inputs to be homogeneous, that is to say, inputs should possess similar formats. However, as mentioned above, big data comes from heterogeneous sources, making traditional min- ing algorithms inapplicable to handle this problem in the forthcoming big data revolution. Most objects and data in the real world are of multiple types and interconnected, forming complex, heterogeneous but often semistructured information networks [14]. There are three main challenges that make mining heterogeneous networks a new game:

  1. Information propagates across heterogeneous nodes and links.

  2. Rich semantic information should be carefully and effectively parsed to extract the representative network structures.

  3. User demands should be taken into consideration to automatically select the right relationships. Sun and Han [14] proposed a novel semistructured heterogeneous network modeling focusing on the objects and the interactions among them. In this way, rich semantic information and valuable schema-level information can be retained. A directed graph with an object-type mapping function and a link-type mapping function is utilized to represent the heterogeneous information networks effectively.

  

  Intelligent computation techniques can be translated into “machine learning”

  A Survey of Big Data and Computational Intelligence in Networking

  programming explicit algorithms with good performance is difficult or unfea- sible. There is a large amount of research taking place to combine advanced machine learning techniques with networked big data analysis. Shang et al. [15] suggested a novel fault diagnosis model of computer networks based on rough set and back-propagation (BP) neural networks. They retrieved faults as a series of rules and reduced them to a minimum diagnosis rules by rough set theory; then a neural network was designed to learn from these rules in order to identify and localize faults quickly and accurately. Sankaran et al. [16] designed a hard- ware framework embedded with a machine learning coprocessor to emphasize intrusion detection in case of attacks unknown for the signature library. They chose a vector space model with a K-nearest neighbors (K-NN) classifier and a radial basis function (RBF)-based classifier, concluding that machine learning processors enable improvement in energy saving, processing speed, and detec- tion accuracy, especially dealing with big data sets. Chung et al. [17] addressed the significant advantages that deep neural networks (DNNs) demonstrate in pat- tern recognition applications, regardless of the biggest drawback of computation resource consumption and training time reaching a tenfold addition compared to traditional techniques. In order to cope with this, a data-parallel Hessian-free second-order optimization algorithm is used to train DNN and implemented by a large-scale speech task on Blue Gene/Q computer system due to the excellent interprocessor communication ability of this system. The results show that per- formance on Blue Gene/Q scales up to 4096 processes lossless in accuracy and that enables the training of DNN using billions of inputs in just a few hours.

  Networked big data possesses unstructured, semistructured, and structured formats. This nature hinders its representation and modeling, which are of paramount importance in the process of big data analytics. Several existing representation and modeling methods have been reported in the current litera- ture, e.g., graph representation, dynamic representation, and matrix representa- tion [18].

1.2.3.1 Graph representation Graphs consisting of nodes and edges can be applied to a variety of situations.

  For example, in social networks, nodes can be regarded as individuals and con- nected edges representing two users build a relationship. Network topology can be represented as nodes for hosts (or switches, routers) and edges for links among them. With a rapid increase in the amount of data, the corresponding scale of graphs dramatically expands to even millions of nodes and edges. Coping with such large graphs can be an extremely complex task considering efficiently

  Big Data and Computational Intelligence in Networking

  of techniques to simplify and refine graphs summed up in three categories: graph spectra [19], graph summarization, and graph compression.

  Graphs can be represented by an adjacency matrix and mapped into its spectrum domain, filtering out small elements by exploiting application-aware thresholds. Finally, a smaller graph can be obtained by means of an inverse trans- form. Sahneh et al. [20] utilized Laplacian spectra to represent transition rate graphs in spreading process modeling of complex networks.

  The original concept of graph summarization comes from the database com- munity [21] and aims to provide users with small and informative graphs based on attributes and relationships of interest.

  Shi et al. [22] addressed an Influence Graph Summarization (IGS) problem abstracting individual influence from complex citation networks. They proposed a summarization scheme based on matrix factorization techniques to handle this problem and tested it on real-world citation networks to prove that this method outperforms the previous method in visual effects and quantitative IGS objective optimization.

  Redundancies are hidden in nodes belonging to similar communities; graph compression in fact focuses on identifying communities. Lim et al. [23] discov- ered the importance of hubs or high-degree nodes disconnecting graphs, gradu- ally removed them from graphs and finally acquired a smaller giant connected component. This compression method proves more suitable for real-world issues than a conventional “caveman community.”

  

  Graph representation can be regarded as a static representation of valid net- work topology and properties, while in practice a large number of networks are dynamically characterized. How to efficiently represent dynamic features of net- works such as topology and configuration has received notable attention from researchers and practitioners. A majority of current approaches take advantage of visualization techniques. Burch et al. [24] chose a series of timesteps and rep- resented each stage as a single graph contributing to a dynamic and hierarchically organized graph set for visualization. Mashima et al. [25] visualized user traffic on radio station and TV-viewing patterns from an Internet Protocol Television (IPTV) service under the assistance of a geographic map metaphor in coping with large-scale dynamic data sets.

  

  Although many studies have been performed on networked big data processing, very few have addressed the problems of representing the various formats of data with a simple model and extracting the core data sets that are smaller but still contain valuable information. Kuang et al. [26] targeted these issues and pro-

  A Survey of Big Data and Computational Intelligence in Networking

  of tensor extension operators, the authors represented the various types of data as subtensors and merged them to a unified tensor. They further devised an incre- mental high-order singular value decomposition method (IHOSVD) to extract the core tensor, which is able to update the orthogonal bases and compute the new core tensor by recursively applying the incremental matrix decomposition algorithm. In comparison with the traditional approaches of data dimensionality reduction—e.g., principal component analysis (PCA) [27], incremental singu- lar value decomposition (SVD) [28], and dynamic tensor analysis (DTA) [29], which are mainly used for low dimension reduction—IHOSVD is particularly suitable for handling high-dimensional data and extracting the core data sets from streaming networked big data with a relatively low time consumption.

  Every day, people surf the Internet, browse websites, post messages, and share personal information like cookies through their own digital devices. This can pose a potential threat to privacy issues through identity theft, online banking abuse, password leakage, and so on. Information security suffers from the prolif- eration of devices connected to the Internet and connected with each other. Big data security issues are related to a set of risk areas, which includes information lifecycle, data creation, and collection processes. Despite complex initial char- acteristics, ultimately, the objective remains the same as traditional data types, devoted to preserving confidentiality, integrity, and availability. Due to user/data mobility requirements, conventional security measures such as firewalls are no longer suitable for big data scenarios. Cloud Security Alliance (CSA) [30], a non-profit organization with a mission to promote the use of best practices for providing security assurance within Cloud Computing and provide education on the uses of Cloud Computing to help secure all other forms of computing, has created a Big Data Working Group, which has focused on the major challenges to implement secure Big Data services. According to CSA, security and privacy challenges can be categorized according to four aspects: infrastructure security (e.g., secure distributed computation using MapReduce), data privacy (e.g., data mining that preserves privacy/granular access), data management (e.g., secure data provenance and storage), and integrity and reactive security (e.g., real-time monitoring of anomalies and attacks).

  When considering reasonable solutions to secure networked big data, in-depth inquiry should be focused on different data sources, complex relation- ships, and unstructured data formats. Encryption, masking, and tokenization are regular approaches for sensitive data protection. Modern security networking issues gradually expand from enterprise to the public Cloud. The internal behav- ior model of Cloud computing consists of a big data storage service retrieval layer (where users match and invoke resources from Cloud service packages according to specific demands), a data refinement task decomposition layer (where complex

  Big Data and Computational Intelligence in Networking

  leveraging rate), a Cloud computing optimization combination evaluation layer (which will optimize and integrate results under thoughtful resource utilizing plans to achieve the best Quality of Service (QoS) target), and a big data Cloud computing algorithm selection layer (where optimal execution strategy will be selected according to different requirements). Considering these features of Cloud computing systems, a four-level security framework was proposed by Cui [31], involving copyright authentication, virtual task decomposition, collab- orative network password protection, disaster recovery scheme development, and so on.

  First, as demonstrated in Ref. [32], wireless big data can be simply explained as a wide range of massive data generated, collected, and stored in wireless networks by wireless devices and users. Wireless big data has a significant influence over network deployment and service quality, and poses a big challenge to traditional solutions due to its dynamic data sets and stochastic nature. Lijun et al. [32] categorized wireless big data through two aspects: socioecological and specific application. From the socioecological perspective, big data can be explained as three types: primordial wireless big data (data sets generated by massive wireless network users such as wireless access behavior), derived wireless big data (spec- trum, transmission, access, and network data served for reliable communication such as resource allocation of transmission signals), and developing wireless big data (data collected from performance testing and evaluation processes). From the specific application perspective, categories can be simply described as cel- lular networks, Wi-Fi hotspots, smart grids, wireless sensor networks (WSNs), IoT, and so on.

  Researches have explored massive efficient data representation as a solution in wireless networking domain. He et al. [33] proposed a unified data analytic model based on the random matrix theory for tackling problems of dynamic data collected from multiple sources. A variety of data types such as signaling data, traffic data, and heterogeneous data are studied to evaluate model perfor- mance in mobile cellular networks. As demonstrated in Ref. [34], large-scale random matrices are applied to model distributed spectrum sensing and network monitoring in a multiple-input multiple-output (MIMO) system. According to specific features such as the large scale and the heterogeneous nature which indi- cate complex interrelationships among data, unstructured data should be taken into consideration as demonstrated in Ref. [34]. A number of subtensors are leveraged to transform various data types into unified tensor representation, then dimension reduction is implemented by SVD method, and, finally, a real case study is simulated on intelligent transportation to evaluate the performance of the proposed model.

  Considering separate temporal, spatial characteristics, and complicated

  A Survey of Big Data and Computational Intelligence in Networking

  and effective methods are studied in recent research, including time series anal- ysis, machine learning techniques, and game theory. The authors in Ref. [35] proved the potential value of time series analysis in traffic pattern recognition tasks. Regular behavior patterns are extracted to predict future patterns. Deep learning, developed from conventional neural networks, has become a power- ful tool in modeling high-level data representation. In Ref. [36], Apache Spark, a distributed framework supporting deep learning, is utilized to apply deep learning to wireless communication with millions of records. Game theory con- tributes to intelligent rational decision-making in many fields such as economics, psychology, computer science, and so on. The authors in Ref. [37] introduced a network management approach based on a multiple cognitive agent-based divide-and-conquer strategy and the Markovian game-theoretic model.

  Apart from data itself, one factor that cannot be neglected is the wire- less channel, which conveys information from end to end. Two commonly accepted methods are deterministic model and stochastic simulation. The for- mer requires enormous spectrum resources consumption, while the latter lacks precision assurance. It is important to seek a balancing point between them to build an accurate channel impulse response (CIR) environment. Network design and optimization is another key factor to achieve energy saving, scalability, and flexibility. Moreover, a wide range of applications such as user mobility analysis and smart grid, making use of wireless big data, can be found in the survey [32].

1.3 Network Architecture for Big Data Transmissions

  Without the foundation of a strong, robust, and flexible network, no reliable data can be obtained, resulting in big losses in academic and industrial fields. For example, business enterprises may lose clients due to data loss considering individual demands, and researchers cannot have access to big data centers dis- tributed across worldwide. A vast amount of network architecture is designed to ensure scalability and flexibility during the data transmission phase. In this chapter, various aspects related to this topic will be discussed.

  As mentioned above, networked big data usually comes from geographic dis- tributed sources. Hence, how to collect valuable network information efficiently and securely has been an open research problem recently. The main research field of this topic concentrates on WSNs or ad hoc networks utilizing sen- sors to capture useful information. Two examples of this are outlined below. Guo et al. [38] proposed a novel and secure mechanism for collecting informa- tion from Internet of Vehicles (IoV, an extension of IoT contributing to unified management in smart transportation), such as vehicle location, speed, and driv-

  Big Data and Computational Intelligence in Networking

  data collection), sink nodes (which are responsible for data transfer), and big data center (which is responsible for data integration and management). The main challenges for this ad hoc network are dynamic topological nodes mod- eling due to high-speed moving vehicles and users’ privacy preservation policy assurance. First, vehicles should be registered in a data center before connect- ing to the network, after which a single sign-on algorithm is utilized to associate authenticated vehicles with the big data center. Finally, a secure information col- lection scheme is applied with lots of operational functions, such as integrity, to protect messages against modification or destruction.

  Given a more general application scenario in WSNs, a traditional data col- lection process selects relay nodes (closer to the sink) to transfer the data, which are from the nodes beyond the communication range of sink node, to the sink node, resulting in faster energy consumption of the relay nodes that are easily disconnected from networks [39]. In order to tackle this, mobile data collectors (MDC), roaming over the whole sensing field while pausing at some points to collect data in short-range communication for energy saving, are brought into use. Three stages collaboratively contribute to data collection using MDC in WSN [40]:

  1. Discovery: Sensors identify the mobile collector existing in their commu- nication range.

  2. Data transfer: Ensure maximum data throughput in minimal contact time.

  3. Routing: The collected data from multiple nodes is forwarded toward the sink or base station. There are two common strategies in data collection protocols using MDC: mobile sink and mobile relay nodes protocols. The mobile sink is characterized by the moving sink which moves in a predesigned path and sends a message to sensors in its communication range. Sensors receiving the message will identify the sink and respond to it with collected data. The mobile relay nodes try to strike a bal- ance between the tour length of the mobile relay and the relay hop count of the sensors in data aggregation to save energy [41].