Hadoop Experts Voice Open Source 3488 pdf pdf

  ® THE EXPERT’S VOICE

  Pro Hadoop

  Build scalable, distributed applications in the cloud Jason Venner

  

Pro Hadoop

Jason Venner

  Pro Hadoop Copyright © 2009 by Jason Venner

All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means,

electronic or mechanical, including photocopying, recording, or by any information storage or retrieval

system, without the prior written permission of the copyright owner and the publisher.

  ISBN-13 (pbk): 978-1-4302-1942-2

  ISBN-13 (electronic): 978-1-4302-1943-9 Printed and bound in the United States of America 9 8 7 6 5 4 3 2 1

Trademarked names may appear in this book. Rather than use a trademark symbol with every occurrence

of a trademarked name, we use the names only in an editorial fashion and to the benefit of the trademark

owner, with no intention of infringement of the trademark.

  

Java™ and all Java-based marks are trademarks or registered trademarks of Sun Microsystems, Inc., in the

US and other countries. Apress, Inc., is not affiliated with Sun Microsystems, Inc., and this book was written

without endorsement from Sun Microsystems, Inc.

  Lead Editor: Matthew Moodie Technical Reviewer: Steve Cyrus Editorial Board: Clay Andres, Steve Anglin, Mark Beckner, Ewan Buckingham, Tony Campbell,

Gary Cornell, Jonathan Gennick, Michelle Lowman, Matthew Moodie, Duncan Parkes, Jeffrey Pepper,

  

Frank Pohlmann, Douglas Pundick, Ben Renow-Clarke, Dominic Shakeshaft, Matt Wade, Tom Welsh

Project Manager: Richard Dal Porto Copy Editors: Marilyn Smith, Nancy Sixsmith Associate Production Director: Kari Brooks-Copony Production Editor: Laura Cheu Compositor: Linda Weidemann, Wolf Creek Publishing Services Proofreader: Linda Seifert Indexer: Becky Hornyak Artist: Kinetic Publishing Services Cover Designer: Kurt Krames Manufacturing Director: Tom Debolski

Distributed to the book trade worldwide by Springer-Verlag New York, Inc., 233 Spring Street, 6th Floor,

  For information on translations, please contact Apress directly at 2855 Telegraph Avenue, Suite 600,

Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use.

eBook versions and licenses are also available for most titles. For more information, reference our

  The information in this book is distributed on an “as is” basis, without warranty. Although every pre- caution has been taken in the preparation of this work, neither the author(s) nor Apress shall have any

liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly

or indirectly by the information contained in this work.

  

The source code for this book is available to readers at

questions pertaining to this book in order to successfully download the code.

  

This book is dedicated to Joohn Choe.

He had the idea, walked me through much of the process,

trusted me to write the book, and helped me through the rough spots.

  Contents at a Glance About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Contents About the Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   n C O N T E N T S

  Creating a Custom Mapper and Reducer After the Job Finishes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Why Do the Mapper and Reducer Extend MapReduceBase? . . . . . .

  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   n

  C O N T E N T S Tuning Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Block Service Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Server Pending Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Storage Allocations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Network I/O Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . NameNode Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DataNode Decommissioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Troubleshooting HDFS Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DataNode or NameNode Pauses . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   n C O N T E N T S

  The Reducer Dissected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Reducer That Uses Three Partitions . . . . . . . . . . . . . . . . . . . . . . . .

  File Types for MapReduce Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sequence File Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JAR, Zip, and Tar Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   n

  C O N T E N T S Running the Debugger on MapReduce Jobs . . . . . . . . . . . . . . . . . . . . . . . .

  Debugging a Task Running on a Cluster . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 8 Advanced and Alternate MapReduce Techniques . . . . . . . 239 Streaming: Running Custom MapReduce Jobs from the Command Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Streaming Command-Line Arguments . . . . . . . . . . . . . . . . . . . . . . . . 243 Using Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Using Counters in Streaming and Pipes Jobs . . . . . . . . . . . . . . . . . . 248 Alternative Methods for Accessing HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . 249

  libhdfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mounting an HDFS File System Using fuse_dfs . . . . . . . . . . . . . . . .

Chaining: Efficiently Connecting Multiple Map and/or

Reduce Steps Map-side Join: Sequentially Reading Data from Multiple Sorted Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Aggregation: A Framework for MapReduce Jobs that Count or Aggregate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Aggregation Using Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Specifying the ValueAggregatorDescriptor Class via

Configuration Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Side Effect Files: Map and Reduce Tasks Can Write

Additional Output Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Handling Acceptable Failure Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Skipping Bad Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enabling the Capacity Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   n C O N T E N T S

CHAPTER 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Solving Problems with Hadoop Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285 Design 1: Brute-Force MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 A Single Reduce Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Key Contents and Comparators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288 A Helper Class for Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 The Mapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 The Combiner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 The Reducer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 The Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 The Pluses and Minuses of the Brute-Force Design . . . . . . . . . . . . 302 Design 2: Custom Partitioner for Segmenting the Address Space . . . . . 302 The Simple IP Range Partitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302 Search Space Keys for Each Reduce Task That May Contain Matching Keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305 Helper Class for Keys Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . 311 Design 3: Future Possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 CHAPTER 10

  . . . . . . .

Projects Based On Hadoop and Future Directions

  Hadoop Core–Related Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hive: The Data Warehouse that Facebook Built . . . . . . . . . . . . . . . . Mahout: Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . ZooKeeper: A High-Performance Collaboration Service . . . . . . . . . Thrift and Protocol Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CloudStore: A Distributed File System . . . . . . . . . . . . . . . . . . . . . . . . Greenplum: An Analytic Engine with SQL . . . . . . . . . . . . . . . . . . . . .

  Hadoop in the Cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cloudera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

   n

  C O N T E N T S API Changes in Hadoop 0.20.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Vaidya: A Rule-Based Performance Diagnostic Tool for

  MapReduce Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Removal of LZO Compression Codecs and the API Glue . . . . . . . . .

New MapReduce Context APIs and Deprecation of the

  Old Parameter Passing APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zero-Configuration, Two-Node Virtual Cluster for Testing . . . . . . .

  Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

APPENDIX A

  JobConf Object in the Driver and Tasks Variable Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public JobConf(Class exampleClass) . . . . . . . . . . . . . . . . . . . . . . . . . public JobConf(Configuration conf, Class exampleClass) . . . . . . . . public JobConf(Path config) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Methods for Loading Additional Configuration Resources . . . . . . . . . . . . public void addResource(String name) . . . . . . . . . . . . . . . . . . . . . . . . public void addResource(Path file) . . . . . . . . . . . . . . . . . . . . . . . . . . . public void reloadConfiguration() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public String get(String name) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void set(String name, String value) . . . . . . . . . . . . . . . . . . . . . public int getInt(String name, int defaultValue) . . . . . . . . . . . . . . . . .

   n C O N T E N T S public long getLong(String name, long defaultValue) . . . . . . . . . . . public float getFloat(String name, float defaultValue) . . . . . . . . . . . public boolean getBoolean(String name, boolean defaultValue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public Configuration.IntegerRanges getRange(String name, String defaultValue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public Collection<String> getStringCollection(String name) . . . . . public String[ ] getStrings(String name, String... defaultValue) . . . public Class<?> getClassByName(String name) throws ClassNotFoundException . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public Class<?>[ ] getClasses(String name, Class<?>... defaultValue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public Class<?> getClass(String name, Class<?> defaultValue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public <U> Class<? extends U> getClass(String name, Class<? extends U> defaultValue, Class<U> xface) . . . . . . . . . public void setClass(String name, Class<?> theClass, Class<?> xface) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public Path getLocalPath(String dirsProp, String pathTrailer) throws IOException . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public File getFile(String dirsProp, String pathTrailer) throws

  IOException . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void deleteLocalFiles() throws IOException . . . . . . . . . . . . . . public Path getLocalPath(String pathString) throws

  IOException . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Methods for Accessing Classpath Resources . . . . . . . . . . . . . . . . . . . . . . . public InputStream getConfResourceAsInputStream (String name) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public Reader getConfResourceAsReader(String name) . . . . . . . . .

   n

  C O N T E N T S Methods for Controlling the Task Classpath . . . . . . . . . . . . . . . . . . . . . . . . public void setJar(String jar) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  Methods for Controlling the Task Execution Environment . . . . . . . . . . . . public void setUser(String user) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public boolean getKeepFailedTaskFiles() . . . . . . . . . . . . . . . . . . . . . . public String getKeepTaskFilesPattern() . . . . . . . . . . . . . . . . . . . . . . public Path getWorkingDirectory() . . . . . . . . . . . . . . . . . . . . . . . . . . . . public int getNumTasksToExecutePerJvm() . . . . . . . . . . . . . . . . . . . public InputFormat getInputFormat() . . . . . . . . . . . . . . . . . . . . . . . . . public void setInputFormat(Class<? extends InputFormat>

theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void setOutputFormat(Class<? extends OutputFormat>

theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

public OutputCommitter getOutputCommitter() . . . . . . . . . . . . . . . . public void setOutputCommitter(Class<? extends

  

OutputCommitter> theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public boolean getCompressMapOutput() . . . . . . . . . . . . . . . . . . . . . public void setMapOutputCompressorClass(Class<? extends

  

CompressionCodec> codecClass) . . . . . . . . . . . . . . . . . . . . . . . . .

public Class<? extends CompressionCodec> getMapOutputCompressorClass(Class<? extends

CompressionCodec> defaultValue) . . . . . . . . . . . . . . . . . . . . . . . .

public void setMapOutputKeyClass(Class<?> theClass) . . . . . . . . . public Class<?> getMapOutputValueClass() . . . . . . . . . . . . . . . . . . . public Class<?> getOutputKeyClass() . . . . . . . . . . . . . . . . . . . . . . . . public Class<?> getOutputValueClass() . . . . . . . . . . . . . . . . . . . . . . .

   n C O N T E N T S

  Methods for Controlling Output Partitioning and Sorting for the Reduce public RawComparator getOutputKeyComparator() . . . . . . . . . . . . . public void setOutputKeyComparatorClass(Class<? extends

  RawComparator> theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public String getKeyFieldComparatorOption() . . . . . . . . . . . . . . . . . . public void setPartitionerClass(Class<? extends Partitioner> theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void setKeyFieldPartitionerOptions(String keySpec) . . . . . . . public RawComparator getOutputValueGroupingComparator() . . . public void setOutputValueGroupingComparator(Class<? extends RawComparator> theClass) . . . . . . . . . . . . . . . . . . . . . . . public Class<? extends Mapper> getMapperClass() . . . . . . . . . . . . public void setMapperClass(Class<? extends Mapper> theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void setMapRunnerClass(Class<? extends MapRunnable> theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public Class<? extends Reducer> getReducerClass() . . . . . . . . . . public void setReducerClass(Class<? extends Reducer> theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void setCombinerClass(Class<? extends Reducer> theClass) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public boolean getSpeculativeExecution() . . . . . . . . . . . . . . . . . . . . . public void setSpeculativeExecution(boolean speculativeExecution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void setMapSpeculativeExecution(boolean speculativeExecution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public boolean getReduceSpeculativeExecution() . . . . . . . . . . . . . . public void setReduceSpeculativeExecution(boolean speculativeExecution) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void setNumMapTasks(int n) . . . . . . . . . . . . . . . . . . . . . . . . . .

   n

  C O N T E N T S public int getNumReduceTasks() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public int getMaxMapAttempts() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public int getMaxReduceAttempts() . . . . . . . . . . . . . . . . . . . . . . . . . . public void setMaxTaskFailuresPerTracker(int noFailures) . . . . . . . public int getMaxMapTaskFailuresPercent() . . . . . . . . . . . . . . . . . . . public int getMaxReduceTaskFailuresPercent() . . . . . . . . . . . . . . . .

  Methods Providing Control Over Job Execution and Naming . . . . . . . . . . public void setJobName(String name) . . . . . . . . . . . . . . . . . . . . . . . . public void setSessionId(String sessionId) . . . . . . . . . . . . . . . . . . . . . public void setJobPriority(JobPriority prio) . . . . . . . . . . . . . . . . . . . . public void setProfileEnabled(boolean newValue) . . . . . . . . . . . . . . public void setProfileParams(String value) . . . . . . . . . . . . . . . . . . . . public Configuration.IntegerRanges getProfileTaskRange

  

(boolean isMap) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

public void setProfileTaskRange(boolean isMap, String

newValue) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void setMapDebugScript(String mDbgScript) . . . . . . . . . . . . public void setReduceDebugScript(String rDbgScript) . . . . . . . . . . public void setJobEndNotificationURI(String uri) . . . . . . . . . . . . . . . . public void setQueueName(String queueName) . . . . . . . . . . . . . . . . void setMaxVirtualMemoryForTask(long vmem) { . . . . . . . . . . . . . .

   n A B O U T T H E A U T H O R

  Convenience Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void clear() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . public void writeXml(OutputStream out) throws IOException . . . . . public void setClassLoader(ClassLoader classLoader) . . . . . . . . . .

  Methods Used to Pass Configurations Through SequenceFiles . . . . . . . . public void write(DataOutput out) throws IOException . . . . . . . . . .

  INDEX

  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . About the Author n JASON VENNER

  is a software developer with more than 20 years of experience developing highly scaled, high-performance systems. Earlier, he worked primarily in the financial services industry, building high-performance check-processing systems. His more recent experience has been building the infrastructure to support highly utilized web sites. He has an avid inter- est in the biological sciences and is an FAA certificated flight instructor.

  About the Technical Reviewer n SIA CYRUS

  

’s experience in computing spans many decades and areas of software develop-

  ment. During the 1980s, he specialized in database development in Europe. In the 1990s, he moved to the United States, where he focused on client/server applications. Since 2000, he has architected a number of middle-tier business processes. And most recently, he has been spe- cializing in Web 2.0, Ajax, portals, and cloud computing.

  Sia is an independent software consultant who is an expert in Java and development of Java enterprise-class applications. He has been responsible for innovative and generic soft- ware, holding a U.S. patent in database-driven user interfaces. Sia created a very successful configuration-based framework for the telecommunications industry, which he later con- verted to the Spring Framework. His passion could be entitled “Enterprise Architecture in Open Source.”

  When not experimenting with new technologies, Sia enjoys playing ice hockey, especially with his two boys, Jason and Brandon.

  Acknowledgments

  would like to thank the people of Attributor.com, as they provided me the opportunity

I

  to learn Hadoop. They gracefully let my mistakes pass—and there were some large-scale mistakes—and welcomed my successes.

  I would also like to thank Richard M. Stallman, one of the giants who support the world. I remember the days when I couldn’t afford to buy a compiler, and had to sneak time on the university computers, when only people who signed horrible NDAs and who worked at large organizations could read the Unix source code. His dedication and yes, fanaticism, has changed our world substantially for the better. Thank you, Richard.

  Hadoop rides on the back, sweat, and love of Doug Cutting, and many people of Yahoo! Inc. Thank you Doug and Yahoo! crew. All of the Hadoop users and contributors who help each other on the mailing lists are wonderful people. Thank you.

  I would also like to thank the Apress staff members who have applied their expertise to make this book into something readable.

  Introduction

  his book is a concise guide to getting started with Hadoop and getting the most out of your

T

  Hadoop clusters. My early experiences with Hadoop were wonderful and stressful. While Hadoop supplied the tools to scale applications, it lacked documentation on how to use the framework effectively. This book provides that information. It enables you to rapidly and pain- lessly get up to speed with Hadoop. This is the book I wish was available to me when I started using Hadoop.

Who This Book Is For

  This book has three primary audiences: developers who are relatively new to Hadoop or MapReduce and must scale their applications using Hadoop; system administrators who must deploy and manage the Hadoop clusters; and application designers looking for a detailed understanding of what Hadoop will do for them. Hadoop experts will learn some new details and gain insights to add to their expertise.

How This Book Is Structured

  This book provides step-by-step instructions and examples that will take you from just begin- ning to use Hadoop to running complex applications on large clusters of machines. Here’s a brief rundown of the book’s contents:

  Chapter 1, Getting Started with Hadoop Core: This chapter introduces Hadoop Core and MapReduce applications. It then walks you through getting the software, installing it on your computer, and running the basic examples. Chapter 2, The Basics of a MapReduce Job: This chapter explores what is involved in writ-

  ing the actual code that performs the map and the reduce portions of a MapReduce job, and how to configure a job to use your map and reduce code.

  Chapter 3, The Basics of Multimachine Clusters: This chapter walks you through the basics

  of creating a multimachine Hadoop cluster. It explains what the servers are, how the serv- ers interact, basic configuration, and how to verify that your cluster is up and running successfully. You’ll also find out what to do if a cluster doesn’t start.

  Chapter 4, HDFS Details for Multimachine Clusters: This chapter covers the details of the Hadoop Distributed File System (HDFS) and provides detailed guidance on the installa- tion, running, troubleshooting, and recovery of your HDFS installations.

  xxvi n

  I N T R O D U C T I O N

  Chapter 5, MapReduce Details for Multimachine Clusters: This chapter gives you a detailed

  understanding of what a MapReduce job is and what the Hadoop Core framework actually does to execute your MapReduce job. You will learn how to set your job classpath and use shared libraries. It also covers the input and output formats used by MapReduce jobs.

  Chapter 6, Tuning Your MapReduce Jobs: In this chapter, you will learn what you can tune,

  how to tell what needs tuning, and how to tune it. With this knowledge, you will be able to achieve optimal performance for your clusters.

  Chapter 7, Unit Testing and Debugging: When your job is run across many machines,

  debugging becomes quite a challenge. Chapter 7 walks you through how to debug your jobs. The examples and unit testing framework provided in this chapter also help you know when your job is working as designed.

  Chapter 8, Advanced and Alternate MapReduce Techniques: This chapter demonstrates

  how to use several advanced features of Hadoop Core: map-side joins, chain mapping, streaming, pipes, and aggregators. You will also learn how to configure your jobs to con- tinue running when some input is bad. Streaming is a particularly powerful tool, as it allows scripts and other external programs to be used to provide the MapReduce func- tionality.

  Chapter 9, Solving Problems with Hadoop: This chapter describes step-by-step develop-

  ment of a nontrivial MapReduce job, including the whys of the design decisions. The sample MapReduce job performs range joins, and uses custom comparator and parti- tioner classes.

  Chapter 10, Projects Based on Hadoop and Future Directions: This chapter provides a

  summary of several projects that are being built on top of Hadoop Core: distributed column-oriented databases, distributed search, matrix manipulation, and machine learn- ing. There are also references for training and support and future directions for Hadoop Core. Additionally, this chapter provides a short summary of my favorite tools in the examples: a zero-configuration, two-node virtual cluster.

  Appendix, The JobConf Object in Detail: The JobConf object is the heart of the application

  developer’s interaction with Hadoop. This book’s appendix goes through each method in detail.

Prerequisites

  For those of you who are new to Hadoop, I strongly urge you to try Cloudera’s open source Hadoop 0.18.3 with bug fixes and some new features back-ported in and added-in hooks to the support scribe log file aggregation service ( ).

  The Cloudera folks have Amazon machine images (AMIs), Debian and RPM installer files, and an online configuration tool to generate configuration files. If you are struggling with Hadoop 0.19 issues, or some of the 0.18.3 issues are biting you, please shift to this distribution. It will reduce your pain.

  xxvii n

  I N T R O D U C T I O N

  The following are the stock Hadoop Core distributions at the time of this writing: ฀ •฀ Hadoop฀0.18.3฀is฀a฀good฀distribution,฀but฀has฀a฀couple฀of฀issues฀related฀to฀file฀descriptor฀ leakage and reduce task stalls.

  ฀ •฀ Hadoop฀0.19.0฀should฀be฀avoided,฀as฀it฀has฀data฀corruption฀issues฀related฀to฀the฀append฀ and sync changes. ฀ •฀ Hadoop฀0.19.1฀looks฀to฀be฀a฀reasonably฀stable฀release฀with฀many฀useful฀features. ฀ •฀ Hadoop฀0.20.0฀has฀some฀major฀API฀changes฀and฀is฀still฀unstable.

  The examples in this book will work with Hadoop 0.19.0, and 0.19.1, and most of the examples will work with the Cloudera 0.18.3 distribution. Separate Eclipse projects are pro- vided for each of these releases.

Downloading the Code

  All of the examples presented in this book can be downloaded from the Apress web site sourcecode.

  The sample code is designed to be imported into Eclipse as a complete project. There are several versions of the code, each a designated version of Hadoop Core that includes that Hadoop Core version.

  The src directory has the source code for the examples. The bulk of the examples are in the package com.apress.hadoopbook.examples, and subpackages are organized by chapter: ch2, ch5, ch7, and ch9, as well as jobconf and advancedtechniques. The test examples are under test/src in the corresponding package directory. The directory src/config contains the con- figuration files that are loaded as Java resources.

  Three directories contain JAR or zip files that have specific licenses. The directory apache_ licensed_lib contains the JARs and source zip files for Apache licensed items. The directory bsd_license contains the items that are provided under the BSD license. The directory other_ licenses contains items that have other licenses. The relevant license files are also in these directories.

  A README.txt file has more details about the downloadable code.

  Contacting the Author

  Jason Venner can be contacted via e-mail at

  C H A P T E R 1 Getting Started with Hadoop Core

  pplications frequently require more resources than are available on an inexpensive

A

  machine. Many organizations find themselves with business processes that no longer fit on a single cost-effective computer. A simple but expensive solution has been to buy specialty machines that have a lot of memory and many CPUs. This solution scales as far as what is sup- ported by the fastest machines available, and usually the only limiting factor is your budget. An alternative solution is to build a high-availability cluster. Such a cluster typically attempts to look like a single machine, and typically requires very specialized installation and adminis- tration services. Many high-availability clusters are proprietary and expensive.

  A more economical solution for acquiring the necessary computational resources is cloud computing. A common pattern is to have bulk data that needs to be transformed, where the processing of each data item is essentially independent of other data items; that is, using a single-instruction multiple-data (SIMD) algorithm. Hadoop Core provides an open source framework for cloud computing, as well as a distributed file system.

  This book is designed to be a practical guide to developing and running software using Hadoop Core, a project hosted by the Apache Software Foundation. This chapter introduces Hadoop Core and details how to get a basic Hadoop Core installation up and running.

Introducing the MapReduce Model

  Hadoop supports the MapReduce model, which was introduced by Google as a method of solving a class of petascale problems with large clusters of inexpensive machines. The model is based on two distinct steps for an application: ฀ •฀ Map: An initial ingestion and transformation step, in which individual input records can be processed in parallel.

  ฀ •฀ Reduce: An aggregation or summarization step, in which all associated records must be processed together by a single entity.

  The core concept of MapReduce in Hadoop is that input may be split into logical chunks, and each chunk may be initially processed independently, by a map task. The results of these individual processing chunks can be physically partitioned into distinct sets, which are then C H A P T E R 1 n฀

  G E T T I N G S T A R T E D W I T H H A D O O P C O R E

  Key6 ValueC Key1 Value1 Value0

  2. Normalize the URLs.

  Ingest the URLs and their associated metadata.

  My first MapReduce application was a specialized web crawler. This crawler received as input large sets of media URLs that were to have their content fetched and processed. The media items were large, and fetching them had a significant cost in time and resources. The job had several steps: 1.

  The application developer needs to provide only four items to the Hadoop framework: the class that will read the input records and transform them into one key/value pair per record, a map method, a reduce method, and a class that will transform the key/value pairs that the reduce method outputs into output records.

  A map task may run on any compute node in the cluster, and multiple map tasks may be running in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output of all of the maps will be partitioned, and each parti- tion will be sorted. There will be one partition for each reduce task. Each partition’s sorted keys and the values associated with the keys are then processed by the reduce task. There may be multiple reduce tasks running in parallel on the cluster.

  The MapReduce model

  Task Output Dataset Figure 1-1.

  ValueD ValueB Record Record Record Record Record Reduce

  Key4 Value9 Key1 Key3 Key8 Key5 Value0

Key6 ValueC

ValueA

  Key2 Key2 Value7 Value8

  Key3 ValueA Key5 ValueB Key7 Value2 Key3 Value6

  Value9 Key8 Value5 ValueD

  2

  Value7 Value8 Key4 Value4

  Value3 Value4 Key8 Value5 Value3

  Value1 Value2 Key2 Key4

  Map Task Key1 Key2 Key7

  Task Split Split Map Task

  Split Map Task The Map The Reduce Reduce

  Input Dataset Shuffle and Sort Shuffle and Sort

  Record Record Record Record Record Record Record

  Record Record Record Record Record Record

  Record Record Record Record Record Record

  sorted. Each sorted chunk is passed to a reduce task. Figure 1-1 illustrates how the MapReduce model works.

  3. Eliminate duplicate URLs.

  3 n฀

  C H A P T E R 1 G E T T I N G S T A R T E D W I T H H A D O O P C O R E 4.

  Filter the URLs against a set of exclusion and inclusion filters.

  5. Filter the URLs against a do not fetch list.

  6. Filter the URLs against a recently seen set.

  7. Fetch the URLs.

  8. Fingerprint the content items.

  9. Update the recently seen set.

  10. Prepare the work list for the next application.

  I had 20 machines to work with on this project. The previous incarnation of the appli- cation was very complex and used an open source queuing framework for distribution. It performed very poorly. Hundreds of work hours were invested in writing and tuning the appli- cation, and the project was on the brink of failure. Hadoop was suggested by a member of a different team.

  After spending a day getting a cluster running on the 20 machines, and running the exam- ples, the team spent a few hours working up a plan for nine map methods and three reduce methods. The goal was to have each map or reduce method take less than 100 lines of code. By the end of the first week, our Hadoop-based application was running substantially faster and more reliably than the prior implementation. Figure 1-2 illustrates its architecture. The finger- print step used a third-party library that had a habit of crashing and occasionally taking down the entire machine.

  

Recently

Input URLs and Metadata

  

Seen

Map

Dataset

Ingest and

  Normalize Reduce Map Map

Map

Do Not

  Suppress

Identity

Duplicates Filters Fetch Recently

  Reduce Map Map Seen

Suppress

Updates

  Fetch Fingerprint

Duplicates

Reduce Summary

  Summarize Figure 1-2.

  The architecture of my first MapReduce application

  The ease with which Hadoop distributed the application across the cluster, along with the ability to continue to run in the event of individual machine failures, made Hadoop one of my favorite tools.

  Both Google and Yahoo handle applications on the petabyte scale with MapReduce clusters. In early 2008, Google announced that it processes 20 petabytes of data a day with html). C H A P T E R 1 n฀

  G E T T I N G S T A R T E D W I T H H A D O O P C O R E

  Hadoop is the Apache Software Foundation top-level project that holds the various Hadoop subprojects that graduated from the Apache Incubator. The Hadoop project provides and sup- ports the development of open source software that supplies a framework for the development of highly scalable distributed computing applications. The Hadoop framework handles the processing details, leaving developers free to focus on application logic.

  n Note The Hadoop logo is a stuffed yellow elephant. And Hadoop happened to be the name of a stuffed yellow elephant owned by the child of the principle architect.

  

  The Apache Hadoop project develops open-source software for reliable, scalable, distrib- uted computing, including: Hadoop Core, our flagship sub-project, provides a distributed filesystem (HDFS) and support for the MapReduce distributed computing metaphor. HBase builds on Hadoop Core to provide a scalable, distributed database. Pig is a high-level data-flow language and execution framework for parallel computa- tion. It is built on top of Hadoop Core. ZooKeeper is a highly available and reliable coordination system. Distributed applica- tions use ZooKeeper to store and mediate updates for critical shared state. Hive is a data warehouse infrastructure built on Hadoop Core that provides data sum- marization, adhoc querying and analysis of datasets.

  The Hadoop Core project provides the basic services for building a cloud computing envi- ronment with commodity hardware, and the APIs for developing software that will run on that cloud. The two fundamental pieces of Hadoop Core are the MapReduce framework, the cloud computing environment, and he Hadoop Distributed File System (HDFS).

  n Note Within the Hadoop Core framework, MapReduce is often referred to as

  mapred, and HDFS is often referred to as dfs.

  5 n฀

  C H A P T E R 1 G E T T I N G S T A R T E D W I T H H A D O O P C O R E

  The Hadoop Core MapReduce framework requires a shared file system. This shared file system does not need to be a system-level file system, as long as there is a distributed file system plug-in available to the framework. While Hadoop Core provides HDFS, HDFS is not required. In Hadoop JIRA (the issue-tracking system), item 4686 is a tracking ticket to separate HDFS into its own Hadoop project. In addition to HDFS, Hadoop Core supports the Cloud- comes with plug-ins for HDFS, CloudStore, and S3. Users are also free to use any distributed file system that is visible as a system-mounted file system, such as Network File System (NFS), Global File System (GFS), or Lustre.

  When HDFS is used as the shared file system, Hadoop is able to take advantage of knowl- edge about which node hosts a physical copy of input data, and will attempt to schedule the task that is to read that data, to run on that machine. This book mainly focuses on using HDFS as the file system.

Hadoop Core MapReduce

  The Hadoop Distributed File System (HDFS)MapReduce environment provides the user with a sophisticated framework to manage the execution of map and reduce tasks across a cluster of machines. The user is required to tell the framework the following: ฀ •฀ The฀location(s)฀in฀the฀distributed฀file฀system฀of฀the฀job฀input ฀ •฀ The฀location(s)฀in฀the฀distributed฀file฀system฀for฀the฀job฀output ฀ •฀ The฀input฀format ฀ •฀ The฀output฀format ฀ •฀ The฀class฀containing฀the฀map฀function ฀ •฀ Optionally.฀the฀class฀containing฀the฀reduce฀function ฀ •฀ The฀JAR฀file(s)฀containing฀the฀map฀and฀reduce฀functions฀and฀any฀support฀classes

  If a job does not need a reduce function, the user does not need to specify a reducer class, and a reduce phase of the job will not be run. The framework will partition the input, and schedule and execute map tasks across the cluster. If requested, it will sort the results of the map task and execute the reduce task(s) with the map output. The final output will be moved to the output directory, and the job status will be reported to the user.

  MapReduce is oriented around key/value pairs. The framework will convert each record of input into a key/value pair, and each pair will be input to the map function once. The map output is a set of key/value pairs—nominally one pair that is the transformed input pair, but it is perfectly acceptable to output multiple pairs. The map output pairs are grouped and sorted by key. The reduce function is called one time for each key, in sort sequence, with the key and the set of values that share that key. The reduce method may output an arbitrary number of key/value pairs, which are written to the output files in the job output directory. If the reduce output keys are unchanged from the reduce input keys, the final output will be sorted.

  6 n฀ C H A P T E R 1 G E T T I N G S T A R T E D W I T H H A D O O P C O R E

  The framework provides two processes that handle the management of MapReduce jobs: ฀ •฀ TaskTracker฀manages the execution of individual map and reduce tasks on a compute node in the cluster.

  ฀ •฀ JobTracker฀accepts job submissions, provides job monitoring and control, and man- ages the distribution of tasks to the TaskTracker nodes.

  Generally, there is one JobTracker process per cluster and one or more TaskTracker pro- cesses per node in the cluster. The JobTracker is a single point of failure, and the JobTracker will work around the failure of individual TaskTracker processes.

  n Note

  One very nice feature of the Hadoop Core MapReduce environment is that you can add TaskTracker nodes to a cluster while a job is running and have the job spread out onto the new nodes.

The Hadoop Distributed File System

  HDFS is a file system that is designed for use for MapReduce jobs that read input in large chunks of input, process it, and write potentially large chunks of output. HDFS does not handle random access particularly well. For reliability, file data is simply mirrored to multiple storage nodes. This is referred to as replication in the Hadoop community. As long as at least one replica of a data chunk is available, the consumer of that data will not know of storage server failures.

  HDFS services are provided by two processes: ฀ •฀ NameNode฀handles management of the file system metadata, and provides manage- ment and control services.

  ฀ •฀ DataNode฀provides block storage and retrieval services.