Hadoop Practice Techniques Alex Holmes 1598 pdf pdf

IN

PRACTICE

Alex Holmes

MANNING

Hadoop in Practice

Hadoop in Practice
ALEX HOLMES

MANNING
SHELTER ISLAND

For online information and ordering of this and other Manning books, please visit
www.manning.com. The publisher offers discounts on this book when ordered in quantity.
For more information, please contact
Special Sales Department
Manning Publications Co.

20 Baldwin Road
PO Box 261
Shelter Island, NY 11964
Email: orders@manning.com

©2012 by Manning Publications Co. All rights reserved.

No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in
any form or by means electronic, mechanical, photocopying, or otherwise, without prior written
permission of the publisher.

Many of the designations used by manufacturers and sellers to distinguish their products are
claimed as trademarks. Where those designations appear in the book, and Manning
Publications was aware of a trademark claim, the designations have been printed in initial caps
or all caps.

Recognizing the importance of preserving what has been written, it is Manning’s policy to have
the books we publish printed on acid-free paper, and we exert our best efforts to that end.
Recognizing also our responsibility to conserve the resources of our planet, Manning books
are printed on paper that is at least 15 percent recycled and processed without the use of

elemental chlorine.

Manning Publications Co.
20 Baldwin Road
PO Box 261
Shelter Island, NY 11964

Development editor:
Copyeditors:
Proofreader:
Typesetter:
Illustrator:
Cover designer:

ISBN 9781617290237
Printed in the United States of America
1 2 3 4 5 6 7 8 9 10 – MAL – 17 16 15 14 13 12

Cynthia Kane
Bob Herbtsman, Tara Walsh

Katie Tennant
Gordan Salinovic
Martin Murtonen
Marija Tudor

To Michal, Marie, Oliver, Ollie, Mish, and Anch

brief contents
PART 1

BACKGROUND AND FUNDAMENTALS . .............................1
1

PART 2

PART 3

PART 4




Hadoop in a heartbeat 3

DATA LOGISTICS..........................................................25
2



Moving data in and out of Hadoop 27

3



Data serialization—working with text and beyond

83

BIG DATA PATTERNS ..................................................137
4




Applying MapReduce patterns to big data 139

5



Streamlining HDFS for big data 169

6



Diagnosing and tuning performance problems 194

DATA SCIENCE ...........................................................251
7




Utilizing data structures and algorithms 253

8



Integrating R and Hadoop for statistics and more 285

9



Predictive analytics with Mahout 305

vii

viii


PART 5

BRIEF CONTENTS

TAMING THE ELEPHANT .............................................333
10



Hacking with Hive 335

11



Programming pipelines with Pig 359

12




Crunch and other technologies 394

13



Testing and debugging 410

contents
preface xv
acknowledgments xvii
about this book xviii

PART 1 BACKGROUND AND FUNDAMENTALS ......................1

1

Hadoop in a heartbeat 3
1.1

1.2
1.3

What is Hadoop? 4
Running Hadoop 14
Chapter summary 23

PART 2 DATA LOGISTICS.................................................25

2

Moving data in and out of Hadoop 27
2.1
2.2

Key elements of ingress and egress
Moving data into Hadoop 30
TECHNIQUE 1
TECHNIQUE 2
TECHNIQUE 3

TECHNIQUE 4
TECHNIQUE 5

29

Pushing system log messages into HDFS with
Flume 33
An automated mechanism to copy files into
HDFS 43
Scheduling regular ingress activities with Oozie
Database ingress with MapReduce 53
Using Sqoop to import data from MySQL 58

ix

48

x

CONTENTS


TECHNIQUE 6
TECHNIQUE 7

2.3

Moving data out of Hadoop 73
TECHNIQUE 8
TECHNIQUE 9
TECHNIQUE 10
TECHNIQUE 11

2.4

3

HBase ingress into HDFS 68
MapReduce with HBase as a data source 70
Automated file copying from HDFS 73
Using Sqoop to export data to MySQL 75
HDFS egress to HBase 78
Using HBase as a data sink in MapReduce 79

Chapter summary

81

Data serialization—working with text and beyond
3.1
3.2

83

Understanding inputs and outputs in MapReduce
Processing common serialization formats 91

84

TECHNIQUE 12 MapReduce and XML 91
TECHNIQUE 13 MapReduce and JSON 95

3.3

Big data serialization formats
TECHNIQUE 14
TECHNIQUE 15
TECHNIQUE 16
TECHNIQUE 17

3.4

99

Working with SequenceFiles 103
Integrating Protocol Buffers with MapReduce 110
Working with Thrift 117
Next-generation data serialization with
MapReduce 120

Custom file formats

127

TECHNIQUE 18 Writing input and output formats for CSV 128

3.5

Chapter summary

136

PART 3 BIG DATA PATTERNS .........................................137

4

Applying MapReduce patterns to big data
4.1

139

Joining 140
TECHNIQUE 19 Optimized repartition joins 142
TECHNIQUE 20 Implementing a semi-join 148

4.2

Sorting

155

TECHNIQUE 21 Implementing a secondary sort 157
TECHNIQUE 22 Sorting keys across multiple reducers

4.3

Sampling 165

4.4

Chapter summary

162

TECHNIQUE 23 Reservoir sampling 165

5

168

Streamlining HDFS for big data 169
5.1

Working with small files

170

5.2

Efficient storage with compression

TECHNIQUE 24 Using Avro to store multiple small files

178

170

xi

CONTENTS

TECHNIQUE 25 Picking the right compression codec for your
data 178
TECHNIQUE 26 Compression with HDFS, MapReduce, Pig, and
Hive 182
TECHNIQUE 27 Splittable LZOP with MapReduce, Hive, and
Pig 187

5.3

6

Chapter summary

193

Diagnosing and tuning performance problems
6.1
6.2

194

Measuring MapReduce and your environment 195
Determining the cause of your performance woes 198
TECHNIQUE 28 Investigating spikes in input data 200
TECHNIQUE 29 Identifying map-side data skew problems 201
TECHNIQUE 30 Determining if map tasks have an overall low
throughput 203
TECHNIQUE 31 Small files 204
TECHNIQUE 32 Unsplittable files 206
TECHNIQUE 33 Too few or too many reducers 208
TECHNIQUE 34 Identifying reduce-side data skew problems 209
TECHNIQUE 35 Determining if reduce tasks have an overall low
throughput 211
TECHNIQUE 36 Slow shuffle and sort 213
TECHNIQUE 37 Competing jobs and scheduler throttling 215
TECHNIQUE 38 Using stack dumps to discover unoptimized user
code 216
TECHNIQUE 39 Discovering hardware failures 218
TECHNIQUE 40 CPU contention 219
TECHNIQUE 41 Memory swapping 220
TECHNIQUE 42 Disk health 222
TECHNIQUE 43 Networking 224

6.3

Visualization

226

TECHNIQUE 44 Extracting and visualizing task execution times

6.4

Tuning

229

TECHNIQUE 45
TECHNIQUE 46
TECHNIQUE 47
TECHNIQUE 48
TECHNIQUE 49
TECHNIQUE 50
TECHNIQUE 51

6.5

Profiling your map and reduce tasks 230
Avoid the reducer 234
Filter and project 235
Using the combiner 236
Blazingly fast sorting with comparators 237
Collecting skewed data 242
Reduce skew mitigation 243

Chapter summary

249

227

xii

CONTENTS

PART 4 DATA SCIENCE ..................................................251

7

Utilizing data structures and algorithms 253
7.1

Modeling data and solving problems with graphs 254
TECHNIQUE 52 Find the shortest distance between two users
TECHNIQUE 53 Calculating FoFs 263
TECHNIQUE 54 Calculate PageRank over a web graph 269

7.2

Bloom filters

275

TECHNIQUE 55 Parallelized Bloom filter creation in
MapReduce 277
TECHNIQUE 56 MapReduce semi-join with Bloom filters

7.3

8

256

Chapter summary

281

284

Integrating R and Hadoop for statistics and more 285
8.1
8.2
8.3

Comparing R and MapReduce integrations
R fundamentals 288
R and Streaming 290

286

TECHNIQUE 57 Calculate the daily mean for stocks 290
TECHNIQUE 58 Calculate the cumulative moving average for
stocks 293

8.4

Rhipe—Client-side R and Hadoop working together

297

TECHNIQUE 59 Calculating the CMA using Rhipe 297

8.5

RHadoop—a simpler integration of client-side R and
Hadoop 301

8.6

Chapter summary

TECHNIQUE 60 Calculating CMA with RHadoop

9

302

304

Predictive analytics with Mahout 305
9.1

Using recommenders to make product suggestions

306

TECHNIQUE 61 Item-based recommenders using movie ratings

9.2

Classification

314

TECHNIQUE 62 Using Mahout to train and test a spam classifier

9.3

Clustering with K-means
Chapter summary

321

325

TECHNIQUE 63 K-means with a synthetic 2D dataset

9.4

311

327

332

PART 5 TAMING THE ELEPHANT ....................................333

10

Hacking with Hive 335
10.1
10.2

Hive fundamentals 336
Data analytics with Hive 338

xiii

CONTENTS

TECHNIQUE 64 Loading log files 338
TECHNIQUE 65 Writing UDFs and compressed partitioned
tables 344
TECHNIQUE 66 Tuning Hive joins 350

10.3

11

Chapter summary 358

Programming pipelines with Pig
11.1
11.2

Pig fundamentals 360
Using Pig to find malicious actors in log data
TECHNIQUE 67
TECHNIQUE 68
TECHNIQUE 69
TECHNIQUE 70
TECHNIQUE 71
TECHNIQUE 72
TECHNIQUE 73
TECHNIQUE 74

11.3

359
362

Schema-rich Apache log loading 363
Reducing your data with filters and projection 368
Grouping and counting IP addresses 370
IP Geolocation using the distributed cache 375
Combining Pig with your scripts 378
Combining data in Pig 380
Sorting tuples 381
Storing data in SequenceFiles 382

Optimizing user workflows with Pig

385

TECHNIQUE 75 A four-step process to working rapidly with big
data 385

11.4

Performance 390

11.5

Chapter summary 393

TECHNIQUE 76 Pig optimizations

12

390

Crunch and other technologies 394
12.1
12.2

What is Crunch? 395
Finding the most popular URLs in your logs

401

TECHNIQUE 77 Crunch log parsing and basic analytics

12.3

Joins

TECHNIQUE 78 Crunch’s repartition join

12.4
12.5

13

405

Cascading 407
Chapter summary 409

Testing and debugging
13.1

402

405

Testing

410

410

TECHNIQUE 79 Unit Testing MapReduce functions, jobs, and
pipelines 413
TECHNIQUE 80 Heavyweight job testing with the
LocalJobRunner 421

13.2

Debugging user space problems

424

TECHNIQUE 81 Examining task logs 424
TECHNIQUE 82 Pinpointing a problem Input Split

429

xiv

CONTENTS

TECHNIQUE 83 Figuring out the JVM startup arguments for a
task 433
TECHNIQUE 84 Debugging and error handling 433

13.3

MapReduce gotchas 437
TECHNIQUE 85 MapReduce anti-patterns

13.4
appendix A
appendix B
appendix C
appendix D

438

Chapter summary 441
Related technologies 443
Hadoop built-in ingress and egress tools 471
HDFS dissected 486
Optimized MapReduce join frameworks 493
index 503

preface
I first encountered Hadoop in the fall of 2008 when I was working on an internet
crawl and analysis project at Verisign. My team was making discoveries similar to those
that Doug Cutting and others at Nutch had made several years earlier regarding how
to efficiently store and manage terabytes of crawled and analyzed data. At the time, we
were getting by with our home-grown distributed system, but the influx of a new data
stream and requirements to join that stream with our crawl data couldn’t be supported by our existing system in the required timelines.
After some research we came across the Hadoop project, which seemed to be a
perfect fit for our needs—it supported storing large volumes of data and provided a
mechanism to combine them. Within a few months we’d built and deployed a MapReduce application encompassing a number of MapReduce jobs, woven together with
our own MapReduce workflow management system onto a small cluster of 18 nodes. It
was a revelation to observe our MapReduce jobs crunching through our data in minutes. Of course we couldn’t anticipate the amount of time that we’d spend debugging
and performance-tuning our MapReduce jobs, not to mention the new roles we took
on as production administrators—the biggest surprise in this role was the number of
disk failures we encountered during those first few months supporting production!
As our experience and comfort level with Hadoop grew, we continued to build
more of our functionality using Hadoop to help with our scaling challenges. We also
started to evangelize the use of Hadoop within our organization and helped kick-start
other projects that were also facing big data challenges.
The greatest challenge we faced when working with Hadoop (and specifically
MapReduce) was relearning how to solve problems with it. MapReduce is its own

xv

xvi

PREFACE

flavor of parallel programming, which is quite different from the in-JVM programming
that we were accustomed to. The biggest hurdle was the first one—training our brains
to think MapReduce, a topic which the book Hadoop in Action by Chuck Lam (Manning Publications, 2010) covers well.
After you’re used to thinking in MapReduce, the next challenge is typically related
to the logistics of working with Hadoop, such as how to move data in and out of HDFS,
and effective and efficient ways to work with data in Hadoop. These areas of Hadoop
haven’t received much coverage, and that’s what attracted me to the potential of this
book—that of going beyond the fundamental word-count Hadoop usages and covering some of the more tricky and dirty aspects of Hadoop.
As I’m sure many authors have experienced, I went into this project confidently
believing that writing this book was just a matter of transferring my experiences onto
paper. Boy, did I get a reality check, but not altogether an unpleasant one, because
writing introduced me to new approaches and tools that ultimately helped better my
own Hadoop abilities. I hope that you get as much out of reading this book as I did
writing it.

acknowledgments
First and foremost, I want to thank Michael Noll, who pushed me to write this book.
He also reviewed my early chapter drafts and helped mold the organization of the
book. I can’t express how much his support and encouragement has helped me
throughout the process.
I’m also indebted to Cynthia Kane, my development editor at Manning, who
coached me through writing this book and provided invaluable feedback on my work.
Among many notable “Aha!” moments I had while working with Cynthia, the biggest
one was when she steered me into leveraging visual aids to help explain some of the
complex concepts in this book.
I also want to say a big thank you to all the reviewers of this book: Aleksei Sergeevich, Alexander Luya, Asif Jan, Ayon Sinha, Bill Graham, Chris Nauroth, Eli Collins,
Ferdy Galema, Harsh Chouraria, Jeff Goldschrafe, Maha Alabduljalil, Mark Kemna,
Oleksey Gayduk, Peter Krey, Philipp K. Janert, Sam Ritchie, Soren Macbeth, Ted Dunning, Yunkai Zhang, and Zhenhua Guo.
Jonathan Seidman, the primary technical editor, did a great job reviewing the
entire book shortly before it went into production. Many thanks to Josh Wills, the creator of Crunch, who kindly looked over the chapter that covers that topic. And more
thanks go to Josh Patterson, who reviewed my Mahout chapter.
All of the Manning staff were a pleasure to work with, and a special shout-out goes
to Troy Mott, Katie Tennant, Nick Chase, Tara Walsh, Bob Herbstman, Michael Stephens, Marjan Bace, and Maureen Spencer.
Finally, a special thanks to my wife, Michal, who had to put up with a cranky husband
working crazy hours. She was a source of encouragement throughout the entire process.

xvii

about this book
Doug Cutting, Hadoop’s creator, likes to call Hadoop the kernel for big data, and I’d
tend to agree. With its distributed storage and compute capabilities, Hadoop is fundamentally an enabling technology for working with huge datasets. Hadoop, to me, provides a bridge between structured (RDBMS) and unstructured (log files, XML, text)
data, and allows these datasets to be easily joined together. This has evolved from traditional use cases, such as combining OLTP and log files, to more sophisticated uses,
such as using Hadoop for data warehousing (exemplified by Facebook) and the field
of data science, which studies and makes new discoveries about data.
This book collects a number of intermediary and advanced Hadoop examples and
presents them in a problem/solution format. Each of the 85 techniques addresses a
specific task you’ll face, like using Flume to move log files into Hadoop or using
Mahout for predictive analysis. Each problem is explored step by step and, as you work
through them, you’ll find yourself growing more comfortable with Hadoop and at
home in the world of big data.
This hands-on book targets users who have some practical experience with
Hadoop and understand the basic concepts of MapReduce and HDFS. Manning’s
Hadoop in Action by Chuck Lam contains the necessary prerequisites to understand
and apply the techniques covered in this book.
Many techniques in this book are Java-based, which means readers are expected to
possess an intermediate-level knowledge of Java. An excellent text for all levels of Java
users is Effective Java, Second Edition, by Joshua Bloch (Addison-Wesley, 2008).

xviii

ABOUT THIS BOOK

xix

Roadmap
This book has 13 chapters divided into five parts.
Part 1 contains a single chapter that’s the introduction to this book. It reviews
Hadoop basics and looks at how to get Hadoop up and running on a single host. It
wraps up with a walk-through on how to write and execute a MapReduce job.
Part 2, “Data logistics,” consists of two chapters that cover the techniques and
tools required to deal with data fundamentals, getting data in and out of Hadoop,
and how to work with various data formats. Getting data into Hadoop is one of the
first roadblocks commonly encountered when working with Hadoop, and chapter 2
is dedicated to looking at a variety of tools that work with common enterprise data
sources. Chapter 3 covers how to work with ubiquitous data formats such as XML
and JSON in MapReduce, before going on to look at data formats better suited to
working with big data.
Part 3 is called “Big data patterns,” and looks at techniques to help you work effectively with large volumes of data. Chapter 4 examines how to optimize MapReduce
join and sort operations, and chapter 5 covers working with a large number of small
files, and compression. Chapter 6 looks at how to debug MapReduce performance
issues, and also covers a number of techniques to help make your jobs run faster.
Part 4 is all about “Data science,” and delves into the tools and methods that help
you make sense of your data. Chapter 7 covers how to represent data such as graphs
for use with MapReduce, and looks at several algorithms that operate on graph data.
Chapter 8 describes how R, a popular statistical and data mining platform, can be integrated with Hadoop. Chapter 9 describes how Mahout can be used in conjunction
with MapReduce for massively scalable predictive analytics.
Part 5 is titled “Taming the elephant,” and examines a number of technologies
that make it easier to work with MapReduce. Chapters 10 and 11 cover Hive and Pig
respectively, both of which are MapReduce domain-specific languages (DSLs) geared
at providing high-level abstractions. Chapter 12 looks at Crunch and Cascading, which
are Java libraries that offer their own MapReduce abstractions, and chapter 13 covers
techniques to help write unit tests, and to debug MapReduce problems.
The appendixes start with appendix A, which covers instructions on installing both
Hadoop and all the other related technologies covered in the book. Appendix B covers low-level Hadoop ingress/egress mechanisms that the tools covered in chapter 2
leverage. Appendix C looks at how HDFS supports reads and writes, and appendix D
covers a couple of MapReduce join frameworks written by the author and utilized in
chapter 4.

Code conventions and downloads
All source code in listings or in text is in a fixed-width font like this to separate it
from ordinary text. Code annotations accompany many of the listings, highlighting
important concepts.

xx

ABOUT THIS BOOK

All of the text and examples in this book work with Hadoop 0.20.x (and 1.x), and
most of the code is written using the newer org.apache.hadoop.mapreduce MapReduce
APIs. The few examples that leverage the older org.apache.hadoop.mapred package are
usually the result of working with a third-party library or a utility that only works with
the old API.
All of the code used in this book is available on GitHub at https://github.com/
alexholmes/hadoop-book as well as from the publisher’s website at www.manning
.com/HadoopinPractice.
Building the code depends on Java 1.6 or newer, git, and Maven 3.0 or newer. Git is
a source control management system, and GitHub provides hosted git repository services. Maven is used for the build system.
You can clone (download) my GitHub repository with the following command:
$ git clone git://github.com/alexholmes/hadoop-book.git

After the sources are downloaded you can build the code:
$ cd hadoop-book
$ mvn package

This will create a Java JAR file, target/hadoop-book-1.0.0-SNAPSHOT-jar-with-dependencies.jar. Running the code is equally simple with the included bin/run.sh.
If you’re running on a CDH distribution, the scripts will run configuration-free. If
you’re running on any other distribution, you’ll need to set the HADOOP_HOME environment variable to point to your Hadoop installation directory.
The bin/run.sh script takes as the first argument the fully qualified Java class name
of the example, followed by any arguments expected by the example class. As an
example, to run the inverted index MapReduce code from chapter 1, you’d run the
following:
$ hadoop fs -mkdir /tmp
$ hadoop fs -put test-data/ch1/* /tmp/
# replace the path below with the location of your Hadoop installation
# this isn't required if you are running CDH3
export HADOOP_HOME=/usr/local/hadoop
$ bin/run.sh com.manning.hip.ch1.InvertedIndexMapReduce \
/tmp/file1.txt /tmp/file2.txt output

The previous code won’t work if you don’t have Hadoop installed. Please refer to
chapter 1 for CDH installation instructions, or appendix A for Apache installation
instructions.

xxi

ABOUT THIS BOOK

Third-party libraries
I use a number of third-party libraries for the sake of convenience. They’re included
in the Maven-built JAR so there’s no extra work required to work with these libraries.
The following table contains a list of the libraries that are in prevalent use throughout
the code examples.
Common third-party libraries
Library

Link

Details

Apache
Commons IO

http://commons.apache.org/io/

Helper functions to help work with
input and output streams in Java.
You’ll make frequent use of the
IOUtils to close connections and to
read the contents of files into strings.

Apache
Commons Lang

http://commons.apache.org/lang/

Helper functions to work with strings,
dates, and collections. You’ll make
frequent use of the StringUtils
class for tokenization.

Datasets
Throughout this book you’ll work with three datasets to provide some variety for the
examples. All the datasets are small to make them easy to work with. Copies of the
exact data used are available in the GitHub repository in the directory https://
github.com/alexholmes/hadoop-book/tree/master/test-data. I also sometimes have
data that’s specific to a chapter, which exists within chapter-specific subdirectories
under the same GitHub location.
NASDAQ FINANCIAL STOCKS

I downloaded the NASDAQ daily exchange data from Infochimps (see http://
mng.bz/xjwc). I filtered this huge dataset down to just five stocks and their start-ofyear values from 2000 through 2009. The data used for this book is available on
GitHub at https://github.com/alexholmes/hadoop-book/blob/master/test-data/
stocks.txt.
The data is in CSV form, and the fields are in the following order:
Symbol,Date,Open,High,Low,Close,Volume,Adj Close

APACHE LOG DATA

I created a sample log file in Apache Common Log Format (see http://mng.bz/
L4S3) with some fake Class E IP addresses and some dummy resources and response
codes. The file is available on GitHub at https://github.com/alexholmes/hadoopbook/blob/master/test-data/apachelog.txt.

xxii

ABOUT THIS BOOK

NAMES

The government’s census was used to retrieve names from http://mng.bz/LuFB and
is available at https://github.com/alexholmes/hadoop-book/blob/master/test-data/
names.txt.

Getting help
You’ll no doubt have questions when working with Hadoop. Luckily, between the wikis
and a vibrant user community your needs should be well covered.
The main wiki is located at http://wiki.apache.org/hadoop/, and contains useful
presentations, setup instructions, and troubleshooting instructions.
The Hadoop Common, HDFS, and MapReduce mailing lists can all be found on
http://hadoop.apache.org/mailing_lists.html.
Search Hadoop is a useful website that indexes all of Hadoop and its ecosystem
projects, and it provides full-text search capabilities: http://search-hadoop.com/.
You’ll find many useful blogs you should subscribe to in order to keep on top of
current events in Hadoop. This preface includes a selection of my favorites:
 Cloudera is a prolific writer of practical applications of Hadoop: http://

www.cloudera.com/blog/.
 The Hortonworks blog is worth reading; it discusses application and future

Hadoop roadmap items: http://hortonworks.com/blog/.
 Michael Noll is one of the first bloggers to provide detailed setup instructions
for Hadoop, and he continues to write about real-life challenges and uses of
Hadoop: http://www.michael-noll.com/blog/.
There are a plethora of active Hadoop Twitter users who you may want to follow,
including Arun Murthy (@acmurthy), Tom White (@tom_e_white), Eric Sammer
(@esammer), Doug Cutting (@cutting), and Todd Lipcon (@tlipcon). The Hadoop
project itself tweets on @hadoop.

Author Online
Purchase of Hadoop in Practice includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and other users. To access and subscribe to
the forum, point your web browser to www.manning.com/HadoopinPractice or
www.manning.com/holmes/. These pages provide information on how to get on the
forum after you are registered, what kind of help is available, and the rules of conduct
on the forum.
Manning’s commitment to our readers is to provide a venue where a meaningful
dialogue between individual readers and between readers and the author can take
place. It’s not a commitment to any specific amount of participation on the part of the
author, whose contribution to the book’s forum remains voluntary (and unpaid). We
suggest you try asking him some challenging questions, lest his interest stray!

ABOUT THIS BOOK

xxiii

The Author Online forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print.

About the author
ALEX HOLMES is a senior software engineer with over 15 years of experience developing large-scale distributed Java systems. For the last four years he has gained expertise
in Hadoop solving big data problems across a number of projects. He has presented at
JavaOne and Jazoon and is currently a technical lead at VeriSign.
Alex maintains a Hadoop-related blog at http://grepalex.com, and is on Twitter at
https://twitter.com/grep_alex.

About the cover illustration
The figure on the cover of Hadoop in Practice is captioned “A young man from Kistanja,
Dalmatia.” The illustration is taken from a reproduction of an album of Croatian traditional costumes from the mid-nineteenth century by Nikola Arsenovic, published by
the Ethnographic Museum in Split, Croatia, in 2003. The illustrations were obtained
from a helpful librarian at the Ethnographic Museum in Split, itself situated in the
Roman core of the medieval center of the town: the ruins of Emperor Diocletian’s
retirement palace from around AD 304. The book includes finely colored illustrations
of figures from different regions of Croatia, accompanied by descriptions of the costumes and of everyday life.
Kistanja is a small town located in Bukovica, a geographical region in Croatia. It is
situated in northern Dalmatia, an area rich in Roman and Venetian history. The word
mamok in Croatian means a bachelor, beau, or suitor—a single young man who is of
courting age—and the young man on the cover, looking dapper in a crisp, white linen
shirt and a colorful, embroidered vest, is clearly dressed in his finest clothes, which
would be worn to church and for festive occasions—or to go calling on a young lady.
Dress codes and lifestyles have changed over the last 200 years, and the diversity by
region, so rich at the time, has faded away. It is now hard to tell apart the inhabitants
of different continents, let alone of different hamlets or towns separated by only a few
miles. Perhaps we have traded cultural diversity for a more varied personal life—certainly for a more varied and fast-paced technological life.
Manning celebrates the inventiveness and initiative of the computer business with
book covers based on the rich diversity of regional life of two centuries ago, brought
back to life by illustrations from old books and collections like this one.

Part 1
Background
and fundamentals

P

art 1 of this book contains chapter 1, which looks at Hadoop’s components
and its ecosystem. The chapter then provides instructions for installing a
pseudo-distributed Hadoop setup on a single host, and includes a system for you
to run all of the examples in the book. Chapter 1 also covers the basics of
Hadoop configuration, and walks you through how to write and run a MapReduce job on your new setup.

Hadoop in a heartbeat

This chapter covers


Understanding the Hadoop ecosystem



Downloading and installing Hadoop



Running a MapReduce job

We live in the age of big data, where the data volumes we need to work with on a
day-to-day basis have outgrown the storage and processing capabilities of a single
host. Big data brings with it two fundamental challenges: how to store and work
with voluminous data sizes, and more important, how to understand data and turn
it into a competitive advantage.
Hadoop fills a gap in the market by effectively storing and providing computational capabilities over substantial amounts of data. It’s a distributed system made up
of a distributed filesystem and it offers a way to parallelize and execute programs on
a cluster of machines (see figure 1.1). You’ve most likely come across Hadoop as it’s
been adopted by technology giants like Yahoo!, Facebook, and Twitter to address
their big data needs, and it’s making inroads across all industrial sectors.
Because you’ve come to this book to get some practical experience with
Hadoop and Java, I’ll start with a brief overview and then show you how to install
Hadoop and run a MapReduce job. By the end of this chapter you’ll have received

3

4

CHAPTER 1

Distributed computation

Distributed storage

Hadoop in a heartbeat

The computation tier
uses a framework
called MapReduce.
A distributed filesystem called
HDFS provides storage.

Server cloud

Hadoop runs on
commodity hardware.

Figure 1.1

The Hadoop environment

a basic refresher on the nuts and bolts of Hadoop, which will allow you to move on to
the more challenging aspects of working with Hadoop.1
Let’s get started with a detailed overview of Hadoop.

1.1

What is Hadoop?
Hadoop is a platform that provides both distributed storage and computational capabilities. Hadoop was first conceived to fix a scalability issue that existed in Nutch,2 an
open source crawler and search engine. At the time Google had published papers that
described its novel distributed filesystem, the Google File System (GFS), and MapReduce, a computational framework for parallel processing. The successful implementation of these papers’ concepts in Nutch resulted in its split into two separate
projects, the second of which became Hadoop, a first-class Apache project.
In this section we’ll look at Hadoop from an architectural perspective, examine
how industry uses it, and consider some of its weaknesses. Once we’ve covered
Hadoop’s background, we’ll look at how to install Hadoop and run a MapReduce job.
Hadoop proper, as shown in figure 1.2, is a distributed master-slave architecture 3
that consists of the Hadoop Distributed File System (HDFS) for storage and MapReduce for computational capabilities. Traits intrinsic to Hadoop are data partitioning and parallel computation of large datasets. Its storage and computational
capabilities scale with the addition of hosts to a Hadoop cluster, and can reach volume
sizes in the petabytes on clusters with thousands of hosts.
In the first step in this section we’ll examine the HDFS and MapReduce architectures.

1

2
3

Readers should be familiar with the concepts provided in Manning’s Hadoop in Action by Chuck Lam, and Effective Java by Joshua Bloch.
The Nutch project, and by extension Hadoop, was led by Doug Cutting and Mike Cafarella.
A model of communication where one process called the master has control over one or more other processes, called slaves.

5

What is Hadoop?

The MapReduce master is
responsible for organizing where
computational work should be
scheduled on the slave nodes.

Master node
Computation
(MapReduce)

The HDFS master is responsible
for partitioning the storage
across the slave nodes and keeping
track of where data is located.

Storage (HDFS)

Add more slave nodes
for increased storage
and processing
capabilities.

Figure 1.2

1.1.1

Slave node

Slave node

Slave node

Computation
(MapReduce)

Computation
(MapReduce)

Computation
(MapReduce)

Storage

Storage

Storage

High-level Hadoop architecture

Core Hadoop components
To understand Hadoop’s architecture we’ll start by looking at the basics of HDFS.
HDFS

HDFS is the storage component of Hadoop. It’s a distributed filesystem that’s modeled
after the Google File System (GFS) paper. 4 HDFS is optimized for high throughput

and works best when reading and writing large files (gigabytes and larger). To support
this throughput HDFS leverages unusually large (for a filesystem) block sizes and data
locality optimizations to reduce network input/output (I/O).
Scalability and availability are also key traits of HDFS, achieved in part due to data replication and fault tolerance. HDFS replicates files for a configured number of times, is
tolerant of both software and hardware failure, and automatically re-replicates data
blocks on nodes that have failed.
Figure 1.3 shows a logical representation of the components in HDFS: the NameNode and the DataNode. It also shows an application that’s using the Hadoop filesystem library to access HDFS.
Now that you have a bit of HDFS knowledge, it’s time to look at MapReduce,
Hadoop’s computation engine.
MAPREDUCE

MapReduce is a batch-based, distributed computing framework modeled after Google’s paper on MapReduce.5 It allows you to parallelize work over a large amount of
4
5

See the Google File System, http://research.google.com/archive/gfs.html.
See MapReduce: Simplified Data Processing on Large Clusters, http://research.google.com/archive/
mapreduce.html.

6

CHAPTER 1

Hadoop in a heartbeat

The HDFS NameNode keeps in memory
the metadata about the filesystem,
such as which DataNodes manage the
blocks for each file.

HDFS clients talk to the NameNode
for metadata-related activities, and
to DataNodes to read and write files.

NameNode
/tmp/file1.txt

Block A
Block B

Client
application

DataNode 2
DataNode 3
DataNode 1
DataNode 3

Hadoop file
system client

DataNode 1

C

B
D

DataNodes communicate
with each other for
pipeline file reads and
writes.

DataNode 2

A

D
C

DataNode 3

C

B
A

Files are made up of blocks, and each file can be replicated
multiple times, meaning there are many identical copies of
each block for the file (by default 3).

Figure 1.3 HDFS architecture shows an HDFS client communicating with the master NameNode
and slave DataNodes.

raw data, such as combining web logs with relational data from an OLTP database to
model how users interact with your website. This type of work, which could take days
or longer using conventional serial programming techniques, can be reduced down
to minutes using MapReduce on a Hadoop cluster.
The MapReduce model simplifies parallel processing by abstracting away the complexities involved in working with distributed systems, such as computational parallelization, work distribution, and dealing with unreliable hardware and software. With
this abstraction, MapReduce allows the programmer to focus on addressing business
needs, rather than getting tangled up in distributed system complications.
MapReduce decomposes work submitted by a client into small parallelized map
and reduce workers, as shown in figure 1.4. The map and reduce constructs used in
MapReduce are borrowed from those found in the Lisp functional programming language, and use a shared-nothing model6 to remove any parallel execution interdependencies that could add unwanted synchronization points or state sharing.
6

A shared-nothing architecture is a distributed computing concept that represents the notion that each node
is independent and self-sufficient.

7

What is Hadoop?

The client submits a
MapReduce job.
Client

Job
Hadoop
MapReduce
master

Job parts

MapReduce decomposes the job
into map and reduce tasks, and
schedules them for remote
execution on the slave nodes.

Job parts

Map
Reduce

Input data

Output data

Map

Reduce

Map

Figure 1.4

A client submitting a job to MapReduce

The role of the programmer is to define map and reduce functions, where the map
function outputs key/value tuples, which are processed by reduce functions to produce the final output. Figure 1.5 shows a pseudo-code definition of a map function
with regards to its input and output.
The map function takes as input a key/value pair, which
represents a logical record from the input data source.
In the case of a file, this could be a line, or if the
input source is a table in a database, it could be a row.
map(key1, value1)

list(key2, value2)

The map function produces zero or more output key/value pairs for
that one input pair. For example, if the map function is a filtering
map function, it may only produce output if a certain condition is
met. Or it could be performing a demultiplexing operation, where a
single input key/value yields multiple key/value output pairs.
Figure 1.5

A logical view of the map function

8

CHAPTER 1

Hadoop in a heartbeat

The shuffle and sort phases are responsible for two primary
activities: determining the reducer that should receive the
map output key/value pair (called partitioning); and ensuring
that, for a given reducer, all its input keys are sorted.

Map output

Shuffle + sort

cat,list(doc1,doc2)

cat,doc1
Mapper 1

Sorted reduce input

Reducer 1

dog,doc1
hamster,doc1

chipmunk,list(doc2)

Reducer 2

dog,list(doc1,doc2)
cat,doc2
Mapper 2

dog,doc2

hamster,list(doc1,doc2)

chipmunk,doc2
hamster,doc2

Map outputs for the same key (such as
“hamster") go to the same reducer, and are
then combined together to form a single
input record for the reducer.
Figure 1.6

Reducer 3

Each reducer has all
of its input keys
sorted.

MapReduce’s shuffle and sort

The power of MapReduce occurs in between the map output and the reduce input, in
the shuffle and sort phases, as shown in figure 1.6.
Figure 1.7 shows a pseudo-code definition of a reduce function.
Hadoop’s MapReduce architecture is similar to the master-slave model in HDFS.
The main components of MapReduce are illustrated in its logical architecture, as
shown in figure 1.8.
With some MapReduce and HDFS basics tucked under your belts, let’s take a look
at the Hadoop ecosystem, and specifically, the projects that are covered in this book.

The reduce function is
called once per unique
map output key.

All of the map output values that
were emied across all the mappers
for “key2” are provided in a list.

reduce (key2, list (value2))

list(key3, value3)

Like the map function, the reduce can output zero to many
key/value pairs. Reducer output can be wrien to flat files
in HDFS, insert/update rows in a NoSQL database, or write
to any data sink depending on the requirements of the job.
Figure 1.7

A logical view of the reduce function

9

What is Hadoop?

The JobTracker coordinates activities across the
slave TaskTracker processes. It accepts MapReduce
job requests from clients and schedules map and
reduce tasks on TaskTrackers to perform the work.
MapReduce clients talk
to the JobTracker to
launch and manage jobs.
JobTracker

Client application

Map task 1
Map task 2

Active jobs

...

Job A
Job B

Hadoop
MapReduce client

Retired/Historical
jobs
Job X
Job Y
Job Z

Reduce task 1
Reduce task 2
Reduce task 3

...

...

...

TaskTracker 1

TaskTracker 2

M

M

M
R

R
R

TaskTracker 3
M

M
R

Map and reduce child processes.
The TaskTracker is a daemon process that spawns child processes
to perform the actual map or reduce work. Map tasks typically
read their input from HDFS, and write their output to the local
disk. Reduce tasks read the map outputs over the network and
write their outputs back to HDFS.
Figure 1.8

1.1.2

MapReduce logical architecture

The Hadoop ecosystem
The Hadoop ecosystem is diverse and grows by the day. It’s impossible to keep track of
all of the various projects that interact with Hadoop in some form. In this book the
focus is on the tools that are currently receiving the greatest adoption by users, as
shown in figure 1.9.
MapReduce is not for the faint of heart, which means the goal for many of these
Hadoop-related projects is to increase the accessibility of Hadoop to programmers
and nonprogrammers. I cover all of the technologies listed in figure 1.9 in this book
and describe them in detail within their respective chapters. In addition, I include
descriptions and installation instructions for all of these technologies in appendix A.

10

Hadoop in a heartbeat

CHAPTER 1

High-level
languages

Predictive
analytics

Crunch

RHadoop

Cascading

RHIPE

Pig

R

HDFS

Miscellaneous

MapReduce

Hadoop

Figure 1.9

Hadoop and related technologies

Let’s look at how to distribute these components across hosts in your environments.

1.1.3

7

Physical architecture
The physical architecture lays out where you install and execute various components.
Figure 1.10 shows an example of a Hadoop physical architecture involving Hadoop
and its ecosystem, and how they would be distributed across physical hosts. ZooKeeper
requires an odd-numbered quorum,7 so the recommended practice is to have at least
three of them in any reasonably sized cluster.
For Hadoop let’s extend the discussion of physical architecture to include CPU,
RAM, disk, and network, because they all have an impact on the throughput and performance of your cluster.
The term commodity hardware is often used to describe Hadoop hardware requirements. It’s true that Hadoop can run on any old servers you can dig up, but you still want
your cluster to perform well, and you don’t want to swamp your operations department
with diagnosing and fixing hardware issues. Therefore, commodity refers to mid-level
rack servers with dual sockets, as much error-correcting RAM as is affordable, and SATA
drives optimized for RAID storage. Using RAID, however, is strongly discouraged on the
DataNodes, because HDFS already has replication and error-checking built-in; but on
the NameNode it’s strongly recommended for additional reliability.

A quorum is a High Availability (HA) concept that represents the minimum number of members required
for a system to still remain online and functioning.

11

What is Hadoop?

Client hosts run application code in conjunction with the Hadoop
ecosystem projects. Pig, Hive, and Mahout are client-side projects
that don't need to be installed on your actual Hadoop cluster.

A single master node runs the master HDFS,
MapReduce, and HBase daemons. Running these masters
on the same host is sufficient for small-to-medium
Hadoop clusters, but with larger clusters it would be
worth considering splitting them onto separate hosts
due to the increased load they put on a single server.

Client
Application
Hive

Pig

The SecondaryNameNode provides
NameNode checkpoint management
services, and ZooKeeper is used
by HBase for metadata storage.

Mahout

RHIPE

R

RHadoop

Primary master
NameNode

JobTracker

HMaster

Secondary master
Secondary
NameNode

ZooKeeper

ZooKeeper

ZooKeeper master
ZooKeeper

Master nodes

Slave nodes

Primary
master

Secondary
master

Slave

Slave

ZooKeeper
master

Slave

Slave

ZooKeeper ...
master
Slave

Slave

Slave ...

Slave
Data node

TaskTracker

The slave hosts run the slave daemons.
In addition non-daemon software
related to R (including Rhipe and
RHadoop) needs to be installed.

Figure 1.10

RegionServer

RHIPE

R

RHadoop

A reasonable question may be, why not split the Hadoop
daemons onto separate hosts? If you were to do this, you
would lose out on data locality (the ability to read from
local disk), which is a key distributed system property of
both the MapReduce and HBase slave daemons.

Hadoop’s physical architecture

From a network topology perspective with regards to switches and firewalls, all of
the master and slave nodes must be able to open connections to each other. For
small clusters, all the hosts would run 1 GB network cards connected to a single,
good-quality switch. For larger clusters look at 10 GB top-of-rack switches that have at
least multiple 1 GB uplinks to dual-central switches. Client nodes also need to be able
to talk to all of the master and slave nodes, but if necessary that access can be from
behind a firewall that permits connection establishment only from the client side.

12

CHAPTER 1

Hadoop in a heartbeat

After reviewing Hadoop’s physical architecture you’ve likely developed a good idea
of who might benefit from using Hadoop. Let’s take a look at companies currently
using Hadoop, and in what capacity they’re using it.

1.1.4

Who’s using Hadoop?
Hadoop has a high level of penetration in high-tech companies, and is starting to
make inroads across a broad range of sectors, including the enterprise (Booz Allen
Hamilton, J.P. Morgan), government (NSA), and health care.
Facebook uses Hadoop, Hive, and HBase for data warehousing and real-time application serving.8 Their data warehousing clusters are petabytes in size with thousands
of nodes, and they use separate HBase-driven, real-time clusters for messaging and
real-time analytics.
Twitter uses Hadoop, Pig, and HBase for data analysis, visualization, social graph
analysis, and machine learning. Twitter LZO-compresses all of its data, and uses Protocol Buffers for serialization purposes, all of which are geared to optimizing the use of
its storage and computing resources.
Yahoo! uses Hadoop for data analytics, machine learning, search ranking, email
antispam, ad optimization, ETL,9 and more. Combined, it has over 40,000 servers running Hadoop with 170 PB of storage.
eBay, Samsung, Rackspace, J.P. Morgan, Groupon, LinkedIn, AOL, Last.fm, and
StumbleUpon are some of the other organizations that are also heavily invested in
Hadoop. Microsoft is also starting to work with Hortonworks to ensure that Hadoop
works on its platform.
Google, in its MapReduce paper, indicated that it used its version of MapReduce
to create its web index from crawl data.10 Google also highlights applications of
MapReduce to include activities such as a distributed grep, URL access frequency
(from log data), and a term-vector algorithm, which determines popular keywords
for a host.
The organizations that use Hadoop grow by the day, and if you work at a Fortune 500
company you almost certainly use a Hadoop cluster in some capacity. It’s clear that as
Hadoop continues to mature, its adoption will continue to grow.
As with all technologies, a key part to being able to work effectively with Hadoop is
to understand its shortcomings and design and architect your solutions to mitigate
these as much as possible.

1.1.5

Hadoop limitations
Common areas identified as weaknesses across HDFS and MapReduce include availability and security. All of their master processes are single points of failure, although

8
9

10

See http://www.facebook.com/note.php?note_id=468211193919.
Extract, transform, and load (ETL) is the process by which data is extracted from outside sources, transformed
to fit the project’s needs, and loaded into the target data sink. ETL is a common process in data warehousing.
In 2010 Google moved to a real-time indexing system called Caffeine: http://googleblog.blogspot.com/
2010/06/our-new-search-index-caffeine.html.

What is Hadoop?

13

you should note that there’s active work on High Availability versions in the community. Security is another area that has its wrinkles, and again another area that’s receiving focus.
HIGH AVAILABILITY

Until the Hadoop 2.x release, HDFS and MapReduce employed single-master models,
resulting in single points of failure.11 The Hadoop 2.x version will eventually bring both
NameNode and JobTracker High Availability (HA) support. The 2.x NameNode HA
design requires shared storage for NameNode metadata, which may require expensive
HA storage. It supports a single standby NameNode, preferably on a separate rack.
SECURITY

Hadoop does offer a security model, but by default it’s disabled. With the security
model disabled, the only security feature that exists in Hadoop is HDFS file and directory-level ownership and permissions. But it’s easy for malicious users to subvert and
assume other users’ identities. By default, all other Hadoop services are wide open,
allowing any user to perform any kind of operation, such as killing another user’s
MapReduce jobs.
Hadoop can be configured to run with Kerberos, a network authentication protocol, which requires Hadoop daemons to authenticate clients, both user and other
Hadoop components. Kerberos can be integrated with an organization’s existing
Active Directory, and therefore offers a single sign-on experience for users. Finally,
and most important for the government sector,