Big Data Analytics Spark Practitioners 552 pdf pdf

  T HE E X P ER T ’ S VO I C E ® IN S P A R K

Big Data Analytics with Spark

  • A Practitioner’s Guide to Using Spark

  for Large Scale Data Analysis

  —

  Mohammed Guller Big Data Analytics

with Spark

A Practitioner’s Guide to Using Spark for Large-Scale Data Processing, Machine

  Learning, and Graph Analytics, and

High-Velocity Data Stream Processing

Mohammed Guller

  Big Data Analytics with Spark

  Copyright © 2015 by Mohammed Guller This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.

  ISBN-13 (pbk): 978-1-4842-0965-3

  ISBN-13 (electronic): 978-1-4842-0964-6 Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.

  The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

  Managing Director: Welmoed Spahr Lead Editor: Celestin John Suresh Development Editor: Chris Nelson Technical Reviewers: Sundar Rajan Raman and Heping Liu Editorial Board: Steve Anglin, Louise Corrigan, Jim DeWolf, Jonathan Gennick, Robert Hutchinson,

  Michelle Lowman, James Markham, Susan McDermott, Matthew Moodie, Jeffrey Pepper, Douglas Pundick, Ben Renow-Clarke, Gwenan Spearing

  Coordinating Editor: Jill Balzano Copy Editor: Kim Burton-Weisman Compositor: SPi Global Indexer: SPi Global Artist: SPi Global

  Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail

  

  , or visit press Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-m . Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Special

  

  Any source code or other supplementary material referenced by the author in this text is available to readers at or detailed information about how to locate your book’s source code, go to

  To my mother, who not only brought me into this world, but also raised me with unconditional love.

Contents at a Glance

  

About the Author

Acknowledgments ..................................................................................................

  

Chapter 1: Big Data Technology Landscape ..........................................................

   ■

Chapter 2: Programming in Scala .......................................................................

   ■

Chapter 3: Spark Core .........................................................................................

   ■

Chapter 4: Interactive Data Analysis with Spark Shell ........................................

   ■

Chapter 5: Writing a Spark Application ...............................................................

   ■

Chapter 6: Spark Streaming ................................................................................

   ■

  

Chapter 7: Spark SQL.........................................................................................

  

Chapter 8: Machine Learning with Spark ..........................................................

  

Chapter 9: Graph Processing with Spark ..........................................................

  

Chapter 10: Cluster Managers ...........................................................................

  

Chapter 11: Monitoring......................................................................................

  

Bibliography ......................................................................................................

Index .....................................................................................................................

Contents

  

  

   ZeroMQ .................................................................................................................................................

   Kafka ....................................................................................................................................................

   Messaging Systems .......................................................................................................

   Parquet ...................................................................................................................................................

   ORC .........................................................................................................................................................

  

  

About the Author

Acknowledgments

Chapter 1: Big Data Technology Landscape ..........................................................

   Hadoop .............................................................................................................................

  

   Thrift .......................................................................................................................................................

   Avro ........................................................................................................................................................

  

   Hive .........................................................................................................................................................

   MapReduce .............................................................................................................................................

   HDFS (Hadoop Distributed File System) .................................................................................................

   SequenceFile ..........................................................................................................................................

  

NoSQL .............................................................................................................................

   Variables ...............................................................................................................................................

   Summary ........................................................................................................................

   A Standalone Scala Application ......................................................................................

   Collections ............................................................................................................................................

   Option Type ...........................................................................................................................................

  

   Traits .....................................................................................................................................................

   Operators ..............................................................................................................................................

   Pattern Matching ..................................................................................................................................

  

   Singletons .............................................................................................................................................

   Classes .................................................................................................................................................

   Functions ..............................................................................................................................................

   Basic Types ...........................................................................................................................................

   Cassandra .............................................................................................................................................

   Getting Started .....................................................................................................................................

   Scala Fundamentals .......................................................................................................

   Everything Is an Expression..................................................................................................................

  

   Functions ..............................................................................................................................................

   Functional Programming (FP) .........................................................................................

   ■

Chapter 2: Programming in Scala .......................................................................

   Summary ........................................................................................................................

   Apache Drill ..........................................................................................................................................

   Presto ...................................................................................................................................................

   Impala ...................................................................................................................................................

   Distributed SQL Query Engine ........................................................................................

   HBase ...................................................................................................................................................

  

  ■

Chapter 3: Spark Core

   Creating an RDD ...................................................................................................................................

  

Summary ........................................................................................................................

   Broadcast Variables ..............................................................................................................................

  

   Spark Jobs .....................................................................................................................

   Cache Memory Management ................................................................................................................

   RDD Caching Is Fault Tolerant...............................................................................................................

   RDD Caching Methods ..........................................................................................................................

   Caching ..........................................................................................................................

  

   Lazy Operations ..............................................................................................................

  

  

   Resilient Distributed Datasets (RDD) ....................................................................................................

   Overview ........................................................................................................................

  

   Application Programming Interface (API) .......................................................................

   Data Sources ..................................................................................................................

  

  

   Application Execution .....................................................................................................

  

   Executors ..............................................................................................................................................

  

   High-level Architecture ...................................................................................................

   Ideal Applications .................................................................................................................................

   Key Features .........................................................................................................................................

  

  ■

Chapter 4: Interactive Data Analysis with Spark Shell ........................................

   Monitoring the Application .............................................................................................

  

   StreamingContext .................................................................................................................................

   Application Programming Interface (API) .......................................................................

  

   Receiver ................................................................................................................................................

   Data Stream Sources ............................................................................................................................

   High-Level Architecture ........................................................................................................................

   Spark Streaming Is a Spark Add-on .....................................................................................................

  

   ■

Chapter 6: Spark Streaming

   Summary ........................................................................................................................

   Debugging the Application .............................................................................................

  

   Getting Started ...............................................................................................................

   Compiling the Code ..............................................................................................................................

   sbt (Simple Build Tool) ..........................................................................................................................

   Compiling and Running the Application .........................................................................

   Hello World in Spark .......................................................................................................

   ■

Chapter 5: Writing a Spark Application

   Summary ........................................................................................................................

  

   Number Analysis ............................................................................................................

   Using the Spark Shell as a Scala Shell ...........................................................................

   REPL Commands ............................................................................................................

   Run .......................................................................................................................................................

   Extract ..................................................................................................................................................

   Download ..............................................................................................................................................

  

  Discretized Stream (DStream) ..............................................................................................................

  

  

  

  

   A Complete Spark Streaming Application ......................................................................

  

  

Chapter 7: Spark SQL.........................................................................................

  

Introducing Spark SQL .................................................................................................

Integration with Other Spark Libraries ............................................................................................... Data Sources ...................................................................................................................................... Hive Interoperability ...........................................................................................................................

  

Performance .................................................................................................................

Reduced Disk I/O ................................................................................................................................ Columnar Storage ............................................................................................................................... Skip Rows Query Optimization .............................................................................................................................

  

Applications ..................................................................................................................

ETL (Extract Transform Load) .............................................................................................................. Distributed JDBC/ODBC SQL Query Engine .........................................................................................

  

Application Programming Interface (API) .....................................................................

Key Abstractions .................................................................................................................................

  Processing Data Programmatically with SQL/HiveQL ......................................................................... Saving a DataFrame ...........................................................................................................................

  

Built-in Functions .........................................................................................................

Aggregate ........................................................................................................................................... Date/Time ........................................................................................................................................... String ..................................................................................................................................................

  

UDFs and UDAFs ...........................................................................................................

Interactive Analysis with Spark SQL JDBC Server ........................................................

  ■

  

Chapter 8: Machine Learning with Spark

  

The MLlib API ................................................................................................................

Data Types .......................................................................................................................................... Model Evaluation ................................................................................................................................

  

An Example MLlib Application ......................................................................................

Dataset ............................................................................................................................................... Code....................................................................................................................................................

  

Spark ML ......................................................................................................................

ML Dataset ......................................................................................................................................... Estimator ............................................................................................................................................ PipelineModel ..................................................................................................................................... Grid Search .........................................................................................................................................

  

An Example Spark ML Application ................................................................................

Dataset ............................................................................................................................................... Code....................................................................................................................................................

  

Summary ......................................................................................................................

  

Chapter 9: Graph Processing with Spark ..........................................................

  Data Abstractions ............................................................................................................................... Graph Properties .................................................................................................................................

  

Summary ......................................................................................................................

  

Chapter 10: Cluster Managers ...........................................................................

  

Chapter 11: Monitoring.....................................................................................

  Monitoring a Spark Streaming Application ......................................................................................... Monitoring Spark SQL JDBC/ODBC Server

  

Summary ......................................................................................................................

  

Bibliography ......................................................................................................

About the Author

  Mohammed Guller is the principal architect at Glassbeam, where he leads

  the development of advanced and predictive analytics products. He is a big data and Spark expert. He is frequently invited to speak at big data–related conferences. He is passionate about building new products, big data analytics, and machine learning.

  Over the last 20 years, Mohammed has successfully led the development of several innovative technology products from concept to release. Prior to joining Glassbeam, he was the founder of TrustRecs. com, which he started after working at IBM for five years. Before IBM, he worked in a number of hi-tech start-ups, leading new product development.

  Mohammed has a master’s of business administration from the University of California, Berkeley, and a master’s of computer applications from RCC, Gujarat University, India.

About the Technical Reviewers

  Sundar Rajan Raman is a big data architect currently working for Bank

  of America. He has a bachelor’s of technology degree from the National Institute of Technology, Silchar, India. He is a seasoned Java and J2EE programmer with expertise in Hadoop, Spark, MongoDB, and big data analytics. He has worked at companies such as AT&T, Singtel, and Deutsche Bank. He is also a platform specialist with vast experience in SonicMQ, WebSphere MQ, and TIBCO with respective certifications.

  His current focus is on big data architecture. More information about Raman is available a .

  I would like to thank my wife, Hema, and daughter, Shriya, for their patience during the review process.

  Heping Liu has a PhD degree in engineering, focusing on the algorithm

  research of forecasting and intelligence optimization and their applications. Dr. Liu is an expert in big data analytics and machine learning. He worked for a few startup companies, where he played a leading role by building the forecasting, optimization, and machine learning models under the big data infrastructure and by designing and creating the big data infrastructure to support the model development.

  Dr. Liu has been active in the academic area. He has published 20 academic papers, which have appeared in Applied Soft Computing and the

  Journal of the Operational Research Society . He has worked as a reviewer

  for 20 top academic journals, such as IEEE Transactions on Evolutionary

  Computations and Applied Soft Computing. Dr. Liu has been the editorial board member of International Journal of Business Analytics.

Acknowledgments

  Many people have contributed to this book directly or indirectly. Without the support, encouragement, and help that I received from various people, it would have not been possible for me to write this book. I would like to take this opportunity to thank those people.

  First and foremost, I would like to thank my beautiful wife, Tarannum, and my three amazing kids, Sarah, Soha, and Sohail. Writing a book is an arduous task. Working full-time and writing a book at the same time meant that I was not spending much time with my family. During work hours, I was busy with work. Evenings and weekends were completely consumed by the book. I thank my family for providing me all the support and encouragement. Occasionally, Soha and Sohail would come up with ingenious plans to get me to play with them, but for most part, they let me work on the book when I should have been playing with them.

  Next, I would like to thank Matei Zaharia, Reynold Xin, Michael Armbrust, Tathagata Das, Patrick Wendell, Joseph Bradley, Xiangrui Meng, Joseph Gonzalez, Ankur Dave, and other Spark developers. They have not only created an amazing piece of technology, but also continue to rapidly enhance it. Without their invention, this book would not exist.

  Spark was new and few people knew about it when I first proposed using it at Glassbeam to solve some of the problems we were struggling with at that time. I would like to thank our VP of Engineering, Ashok Agarwal, and CEO, Puneet Pandit, for giving me the permission to proceed. Without the hands-on experience that I gained from embedding Spark in our product and using it on a regular basis, it would have been difficult to write a book on it.

  Next, I would like to thank my technical reviewers, Sundar Rajan Raman and Heping Liu. They painstakingly checked the content for accuracy, ran the examples to make sure that the code works, and provided helpful suggestions.

  Finally, I would like to thank the people at Apress who worked on this book, including Chris Nelson, Jill Balzano, Kim Burton-Weisman, Celestin John Suresh, Nikhil Chinnari, Dhaneesh Kumar, and others. Jill Balzano coordinated all the book-related activities. As an editor, Chris Nelson’s contribution to this book is invaluable. I appreciate his suggestions and edits. This book became much better because of his involvement. My copy editor, Kim Burton-Weisman, read every sentence in the book to make sure it is written correctly and fixed the problematic ones. It was a pleasure working with the Apress team.

  —Mohammed Guller Danville, CA

Introduction

  This book is a concise and easy-to-understand tutorial for big data and Spark. It will help you learn how to use Spark for a variety of big data analytic tasks. It covers everything that you need to know to productively use Spark.

  One of the benefits of purchasing this book is that it will help you learn Spark efficiently; it will save you a lot of time. The topics covered in this book can be found on the Internet. There are numerous blogs, presentations, and YouTube videos covering Spark. In fact, the amount of material on Spark can be overwhelming. You could spend months reading bits and pieces about Spark at different places on the Web. This book provides a better alternative with the content nicely organized and presented in an easy-to-understand format.

  The content and the organization of the material in this book are based on the Spark workshops that I occasionally conduct at different big data–related conferences. The positive feedback given by the attendees for both the content and the flow motivated me to write this book.

  One of the differences between a book and a workshop is that the latter is interactive. However, after conducting a number of Spark workshops, I know the kind of questions people generally have and I have addressed those in the book. Still, if you have questions as you read the book, I encourage you to contact me via LinkedIn or Twitter. Feel free to ask any question. There is no such thing as a stupid question.

  Rather than cover every detail of Spark, the book covers important Spark-related topics that you need to know to effectively use Spark. My goal is to help you build a strong foundation. Once you have a strong foundation, it is easy to learn all the nuances of a new technology. In addition, I wanted to keep the book as simple as possible. If Spark looks simple after reading this book, I have succeeded in my goal.

  No prior experience is assumed with any of the topics covered in this book. It introduces the key concepts, step by step. Each section builds on the previous section. Similarly, each chapter serves as a stepping-stone for the next chapter. You can skip some of the later chapters covering the different Spark libraries if you don’t have an immediate need for that library. However, I encourage you to read all the chapters. Even though it may not seem relevant to your current project, it may give you new ideas.

  You will learn a lot about Spark and related technologies from reading this book. However, to get the most out of this book, type the examples shown in the book. Experiment with the code samples. Things become clearer when you write and execute code. If you practice and experiment with the examples as you read the book, by the time you finish reading it, you will be a solid Spark developer.

  One of the resources that I find useful when I am developing Spark applications is the official Spark API (application programming interface) documentation. It is available at

  

. As a beginner, you may find it hard to understand, but once you have learned the basic

concepts, you will find it very useful.

  Another useful resource is the Spark mailing list. The Spark community is active and helpful. Not only do the Spark developers respond to questions, but experienced Spark users also volunteer their time helping new users. No matter what problem you run into, chances are that someone on the Spark mailing list has solved that problem.

  And, you can reach out to me. I would love to hear from you. Feedback, suggestions, and questions are welcome.

  —Mohammed Guller LinkedI

  Twitter:

CHAPTER 1 Big Data Technology Landscape We are in the age of big data. Data has not only become the lifeblood of any organization, but is also growing

  exponentially. Data generated today is several magnitudes larger than what was generated just a few years ago. The challenge is how to get business value out of this data. This is the problem that big data–related technologies aim to solve. Therefore, big data has become one of the hottest technology trends over the last few years. Some of the most active open source projects are related to big data, and the number of these projects is growing rapidly. The number of startups focused on big data has exploded in recent years. Large established companies are making significant investments in big data technologies.

  Although the term “big data” is hot, its definition is vague. People define it in different ways. One definition relates to the volume of data; another definition relates to the richness of data. Some define big data as data that is “too big” by traditional standards; whereas others define big data as data that captures more nuances about the entity that it represents. An example of the former would be a dataset whose volume exceeds petabytes or several terabytes. If this data were stored in a traditional relational database (RDBMS) table, it would have billions of rows. An example of the latter definition is a dataset with extremely wide rows. If this data were stored in a relational database table, it would have thousands of columns. Another popular definition of big data is data characterized by three Vs: volume, velocity, and variety. I just discussed volume. Velocity means that data is generated at a fast rate. Variety refers to the fact that data can be unstructured, semi-structured, or multi-structured.

  Standard relational databases could not easily handle big data. The core technology for these databases was designed several decades ago when few organizations had petabytes or even terabytes of data. Today it is not uncommon for some organizations to generate terabytes of data every day. Not only the volume of data, but also the rate at which it is being generated is exploding. Hence there was a need for new technologies that could not only process and analyze large volume of data, but also ingest large volume of data at a fast pace.

  Other key driving factors for the big data technologies include scalability, high availability, and fault tolerance at a low cost. Technology for processing and analyzing large datasets has been extensively researched and available in the form of proprietary commercial products for a long time. For example, MPP (massively parallel processing) databases have been around for a while. MPP databases use a “shared- nothing” architecture, where data is stored and processed across a cluster of nodes. Each node comes with its own set of CPUs, memory, and disks. They communicate via a network interconnect. Data is partitioned across a cluster of nodes. There is no contention among the nodes, so they can all process data in parallel. Examples of such databases include Teradata, Netezza, Greenplum, ParAccel, and Vertica. Teradata was invented in the late 1970s, and by the 1990s, it was capable of processing terabytes of data. However, proprietary MPP products are expensive. Not everybody can afford them.

  This chapter introduces some of the open source big data–related technologies. Although it may seem that the technologies covered in this chapter have been randomly picked, they are connected by a common theme. They are used with Spark, or Spark provides a better alternative to some of these technologies. As you start using Spark, you may run into these technologies. In addition, familiarity with these technologies will help you better understand Spark, which we will introduce in Chapter . Hadoop

  Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets across a cluster of commodity servers. It provides a simple programming framework for large-scale data processing using the resources available across a cluster of computers. Hadoop is inspired by a system invented at Google to create inverted index for its search product. Jeffrey Dean and Sanjay Ghemawat published papers in 2004 describing the system that they created for Google.

  research.

  The first one, titled “MapReduce: Simplified Data Processing on Large Clusters” is available at

  google.com/archive/mapreduce.html. The second one, titled “The Google File System” is available at research.google.com/archive/gfs.html. Inspired by these papers, Doug Cutting and Mike Cafarella

  developed an open source implementation, which later became Hadoop.

  Many organizations have replaced expensive proprietary commercial products with Hadoop for processing large datasets. One reason is cost. Hadoop is open source and runs on a cluster of commodity hardware. You can scale it easily by adding cheap servers. High availability and fault tolerance are provided by Hadoop, so you don’t need to buy expensive hardware. Second, it is better suited for certain types of data processing tasks, such as batch processing and ETL (extract transform load) of large-scale data.

  Hadoop is built on a few important ideas. First, it is cheaper to use a cluster of commodity servers for both storing and processing large amounts of data than using high-end powerful servers. In other words, Hadoop uses scale-out architecture instead of scale-up architecture.

  Second, implementing fault tolerance through software is cheaper than implementing it in hardware. Fault-tolerant servers are expensive. Hadoop does not rely on fault-tolerant servers. It assumes that servers will fail and transparently handles server failures. An application developer need not worry about handling hardware failures. Those messy details can be left for Hadoop to handle.

  Third, moving code from one computer to another over a network is a lot more efficient and faster than moving a large dataset across the same network. For example, assume you have a cluster of 100 computers with a terabyte of data on each computer. One option for processing this data would be to move it to a very powerful server that can process 100 terabytes of data. However, moving 100 terabytes of data will take a long time, even on a very fast network. In addition, you will need very expensive hardware to process data with this approach. Another option is to move the code that processes this data to each computer in your 100-node cluster; it is a lot faster and more efficient than the first option. Moreover, you don’t need high-end servers, which are expensive.

  Fourth, writing a distributed application can be made easy by separating core data processing logic from distributed computing logic. Developing an application that takes advantage of resources available on a cluster of computers is a lot harder than developing an application that runs on a single computer. The pool of developers who can write applications that run on a single machine is several magnitudes larger than those who can write distributed applications. Hadoop provides a framework that hides the complexities of writing distributed applications. It thus allows organizations to tap into a much bigger pool of application developers.

  Although people talk about Hadoop as a single product, it is not really a single product. It consists of three key components: a cluster manager, a distributed compute engine, and a distributed file system (see Figure  ).

  Cluster Manager Distributed Computer Engine Distributed File System Figure 1-1.

   Key conceptual Hadoop components

  Until version 2.0, Hadoop’s architecture was monolithic. All the components were tightly coupled and bundled together. Starting with version 2.0, Hadoop adopted a modular architecture, which allows you to mix and match Hadoop components with non-Hadoop technologies.

  The concrete implementations of the three conceptual components shown in Figur

  YARN MapReduce HDFS

  Figure 1-2. Key Hadoop components

  HDFS and MapReduce are covered in this chapter. YARN is covered in Chapt .

HDFS (Hadoop Distributed File System) HDFS, as the name implies, is a distributed file system. It stores a file across a cluster of commodity servers

  It was designed to store and provide fast access to big files and large datasets. It is scalable and fault tolerant.

  HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-size blocks, also known as partitions or splits. The default block size is 128 MB, but it is configurable. It should be clear from the blocks’ size that HDFS is not designed for storing small files. If possible, HDFS spreads out the blocks of a file across different machines. Therefore, an application can parallelize file-level read and write operations, making it much faster to read or write a large HDFS file distributed across a bunch of disks on different computers than reading or writing a large file stored on a single disk.

  Distributing a file to multiple machines increases the risk of a file becoming unavailable if one of the machines in a cluster fails. HDFS mitigates this risk by replicating each file block on multiple machines. The default replication factor is 3. So even if one or two machines serving a file block fail, that file can still be read. HDFS was designed with the assumption that machines may fail on a regular basis. So it can handle failure of one or more machines in a cluster.

  A HDFS cluster consists of two types of nodes: NameNode and DataNode (see Figure  ). A NameNode manages the file system namespace. It stores all the metadata for a file. For example, it tracks file names, permissions, and file block locations. To provide fast access to the metadata, a NameNode stores all the metadata in memory. A DataNode stores the actual file content in the form of file blocks.

  Block Block Metadata Locations

  Locations Client Client NameNode

  Read Write Blocks Blocks Blocks Blocks Blocks

  DataNode DataNode DataNode DataNode DataNode Figure 1-3.

   HDFS architecture The NameNode periodically receives two types of messages from the DataNodes in an HDFS cluster.

  One is called Heartbeat and the other is called Blockreport. A DataNode sends a heartbeat message to inform the NameNode that it is functioning properly. A Blockreport contains a list of all the data blocks on a DataNode.

  When a client application wants to read a file, it first contacts a NameNode. The NameNode responds with the locations of all the blocks that comprise that file. A block location identifies the DataNode that holds data for that file block. A client then directly sends a read request to the DataNodes for each file block. A NameNode is not involved in the actual data transfer from a DataNode to a client.