The Hadoop Performance Myth pdf pdf

  Strata

  

The Hadoop Performance Myth

  Why Best Practices Lead to Underutilized Clusters, and Which New Tools Can Help

  

Courtney Webster

  The Hadoop Performance Myth

  by Courtney Webster Copyright © 2016 O’Reilly Media Inc. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles ( . For more information, contact our corporate/institutional sales department: 800-998-9938 or

  corporate@oreilly.com.

  Editor: Nicole Tache Production Editor: Kristen Brown Copyeditor: Amanda Kersey Interior Designer: David Futato Cover Designer: Randy Comer Illustrator: Rebecca Demarest March 2016: First Edition

  Revision History for the First Edition

  2016-03-15: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. The

  Hadoop Performance Myth, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc.

  While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights. 978-1-491-95544-4 [LSI]

  The Hadoop Performance Myth

  Hadoop is a popular (if not de facto) framework for processing large data sets through distributed computing. YARN allowed Hadoop to evolve from a MapReduce engine to a big data ecosystem that can run heterogeneous (MapReduce and non-MapReduce) applications simultaneously. This results in larger clusters with more users and workloads than ever before. Traditional recommendations encourage provisioning, isolation, and tuning to increase performance and avoid resource contention but result in highly underutilized clusters. Herein we’ll review the challenges of improving performance and utilization for today’s dynamic, multitenant clusters and how emerging tools help when best practices fall short.

  The Challenge of Predictable Performance

  Hadoop breaks down a large computational problem into tiny, modular pieces across a cluster of commodity hardware. Each computational piece could be run almost anywhere within the cluster and, as a result, could be a little faster or slower based on that machine’s specifications. Hadoop was designed to include redundancy to keep this variability from impacting performance. If a particular task is running slower than expected, Hadoop may launch the same computation on another copy of the target data. Whichever task completes first wins.

  Optimizing performance on Hadoop 1.0 wasn’t necessarily easy but had fewer variables to contend with than later versions. Version 1.0 only ran a specific workload (MapReduce) with known resource needs (CPU and I/O). And it was designed to address what it believed would be the primary performance challenge: machine variance.

  Hadoop 2.0 was restructured to overcome key limitations in scalability, availability, and utilization in the first release. Version 2.0 expands Hadoop’s capabilities beyond just MapReduce so that a cluster can run different types of workloads at the same time. A resource scheduler (YARN) ensures that jobs can start only if the system has the resources the job speculates it needs.

RESOURCE AND PERFORMANCE NEEDS FROM MIXED WORKLOADS

  In addition to Hadoop 2.0 (YARN) supporting more users and organizations in a single cluster,

  

mixed workloads demand different resources and performance goals than MapReduce-only jobs:

  Workload Resource Performance goals constraints Long-running services CPU time Instantaneous availability DAG-of-task (like MapReduce) CPU- or I/O-bound Low scheduler latency High performance/throughput CPU-bound Consistent runtimes, high cluster computing programs utilization Despite these enhancements, maintaining predictable cluster performance is a monumental challenge. The key contributors to the issue include: Hadoop’s design, such as including redundancy to prevent single- machine failure Running mixed workloads with different requirements, which introduce tremendous complexity Resource schedulers, like YARN, prioritize fault tolerance over efficiency YARN can ensure that resources are available immediately before a job begins, but it cannot adjust resources while a job is running (and therefore cannot prevent resource contention) Lastly, more demand: more users, more jobs, more workloads

  Optimizing Cluster Performance

  Today’s clusters are managing more users, organizations, and resource requests than before. This increased demand makes performance critical while, at the same time, the added complexity makes it more difficult to achieve. A marketplace of tools and tricks have been developed to optimize performance as clusters grow in size and complexity. We’ll begin by reviewing the most common optimization techniques — provisioning, isolation, and tuning — and their biggest drawbacks.

  Provisioning

  When performance issues occur, classic logic tells us to scale the cluster. If you double your nodes, job time should be cut in half. This can be disheartening after you’ve painstakingly planned the cluster for projected data, resource, and performance needs. Adding additional nodes increases cost and complexity, leading not only to added capital expenses but additional expert staffing needs.

  Scaling out (or up) can improve performance during peak times of resource contention, but results in an overbuilt system with dormant assets during off- peak times.

  Application Isolation

  Another technique to improve performance is to isolate a workload within its own Hadoop cluster, especially if it is a job under a critical completion deadline. There are a few cases when workload isolation may be sufficient. If you are a massive company with few monolithic applications and the resources for duplicate hardware, this may be a viable strategy. Cluster isolation could also control incompatible workloads (like HBase causing conflicts with MapReduce). Or, perhaps certain data requires isolation for security reasons. Outside of these scenarios, this technique takes us further down a road of expensive overprovisioning, not to mention a regression from traditional goals of a data-centric organization. Data siloed inside its own cluster becomes hard to access or plug into other workflows (requiring the company to utilize snapshots or old data).

  

DRAWBACKS OF PROVISIONING AND ISOLATION

Overprovisioning adds complexity and cost.

  Application isolation is expensive and results in siloed data. Overprovisioning and application isolation are meant to help during peak loads, but lead to overall underutilization.

  Tuning

  Tuning is an essential part of maintaining a Hadoop cluster. Performance benchmarking and adjustment are bona fide methods to identify bad/inefficient code or poorly configured parameters. Cluster administrators must interpret system metrics and optimize for specific workloads (e.g., high CPU utilization versus high I/O). To know what to tune, Hadoop operators often rely on monitoring software for insight into cluster activity. Tools like Ganglia, Cloudera Manager, or Apache Ambari will give you near real-time statistics at the node level, and many provide after-the-fact reports for particular jobs. The more visibility you have on all cluster resources (for example, including a network monitoring tool like , the better.

  Good monitoring alerts you to errors that require immediate attention and helps identify problem areas where tuning can improve performance. Though not the focus of this report, there are myriand troubleshooting tools one can use to pinpoint troublesome jobs and inefficient

  With reporting tools in place, tuning can start right after cluster configuration. Classic benchmarking workloads (TestDFSIO, TeraSort, PiTest, STREAM,

  NNBench and MRBench) can be used to check configuration and baseline

  performance. Best-practices tuning guides recommend the following adjustments to optimize performance.

  Tuning checklist Number of mappers

  If you find that mappers are only running for a few seconds, try to use

  

  Increase mapred.min.split.size to decrease the number of

   mappers allocated in a slot.

  Mapper output

  

   Try filtering out records on the mapper side.

  Use minimal data to form the map output key and map output

  

  Number of reducers

  Reduce tasks should run for five minutes or so and produce at least a

   block’s worth of data.

  Combiners

  Can you specify a combiner to cut the amount of data shuffled between

  

  the mappers and the reducers?

  Compression

  Can you enable map output compression to improve job execution time?

   Custom serialization

   Disks per node

  Adjust the number of disks per node (mapred.local.dir, dfs.name.dir , dfs.data.dir) and test how scaling affects

   execution time .

  JVM reuse

  Consider enabling JVM reuse (mapred.job.reuse.jvm.num.tasks) for workloads with lots

  

  Memory management

  Generally maximize memory for the shuffle, but give the map and

   Make the mapred.child.java.opts property as large as possible for the amount of memory on the task nodes.

2 Minimize disk spilling.

  One spill to disk is optimal. The MapReduce counter spilled_records is a useful metric, as it counts the total number of records that were spilled to disk during a job. Adjust memory allocation using the reference:

  Total Memory = Map Slots + Reduce Slots + TT

  

  • DN + Other Services + O

  Data locality detection

  After going through the classic tuning recommendations, you may want to check if data locality issues are impacting your cluster performance. Hadoop is designed to prioritize data locality (to process computational functions in the same node where the data is stored). In practice, this may not be as

  

  optimized as one would think, especially for large data clusters. Identifying these issues is not a task for the faint-hearted, as you’d likely have to comb through various logs to determine which tasks access which data nodes.

  YARN-specific tuning

  For Hadoop 2.0 implementations, you should also tune some of YARN’s parameters. Start by determining the resources you can assign to YARN by

   subtracting hardware requirements from the total CPU cores and memory.

  Don’t forget to allocate resources for services that don’t subscribe to YARN (like Impala, HBase RegionServer, and Solr) and task buffers (like HDFS

  t’s recommended to right-size YARN’s NodeManager and

  

  The Limitations of Tuning

  All tuning is retrospective....Knowing where you’ve been doesn’t help you figure out where you’re going. tuning may be all that’s required to wring additional performance and improve cluster utilization. But subscribed “best practices” built from benchmarking workloads like TeraSort may or may not map perfectly to live applications or dynamic clusters.

  More often than not, the complicated and retrospective nature of tuning does not solve performance problems:

  Tuning delivers diminishing returns

  Monitoring programs like Ganglia can provide subminute updates on individual node and cluster-wide performance (load chart, CPU utilization, and memory usage). This provides a “what,” but does not help diagnose a “why.” Even with these tools, parsing job history files, tuning parameters, and measuring the impact of minor changes is time- consuming and requires expert-level staff. This can result in a rabbit hole that eventually delivers diminishing returns.

  You can’t tune what you can’t measure

  Most Hadoop monitoring tools report on some resources (like CPU and memory), but not all cluster resources (like network and disk I/O). You may need a complex combination of tools to report all the data required to identify root issues.

  Retrospection can’t guarantee better prediction

  The final nail in the coffin is that all tuning is retrospective. You can’t expect trail maps of known territory to help you navigate an uncharted route. Knowing where you’ve been doesn’t help you figure out where you’re going. For example, what happens for a new job with a new usage profile? What if an old job starts using more or less resources (CPU/memory/disk/network)? What if the demands of other nonjob services change?

  In clusters with dynamic activity (mixed workloads, multitenant applications, and variable ad hoc use), optimizing past issues simply does not future-proof cluster performance.

DRAWBACKS OF TUNING

  

Monitoring tools provide a limited “what” (coarse data that cannot provide visibility down

to the process level) and not a “why.” Parsing log files, modifying variables and testing impact is time-consuming and requires expertise.

There are too many options to be adjusted on a cluster, and what works for one job may not

work for the next.

You need a combination of tools to report on all cluster resources (CPU, memory, network,

and disk I/O). Retrospective analysis cannot future-proof performance.

  How Resource Managers Affect Performance and Utilization

  At this point, you may be asking yourself, “Once I’ve tuned, isn’t my resource manager supposed to help coordinate requests and avoid resource contention? Surely the resource manager could provide consistent performance if it could perfectly allocate jobs.” The answer to this question may surprise you. YARN’s resource manager performs a vital function to ensure that jobs complete in a fault-tolerant (not necessarily high-performance) way. Before we can answer whether YARN can alleviate resource contention, understanding how the resource manager executes tasks and which resources it controls (and which it doesn’t) is necessary to understand its impact on cluster performance.

  How YARN works

  A YARN cluster has a single resource manager that monitors the entire cluster’s resources. Applications submit requests, which include resources needed to complete the task (like memory and CPU) and other constraints (like preferred data locality), to the resource manager.

  The resource manager queries the system and finds a node manager with resources that meet or exceed the application’s request. Node managers (which run on all nodes in the cluster) monitor and launch the containers that will execute the job. If the node manager seems to have enough resources available, it is directed to launch a new container by the resource manager.

   The container then executes the process using its resources.

  YARN prioritizes fault tolerance over performance

  Just before it launches a job, YARN makes specific decisions about task execution that impact performance. For example, consider the concept of data locality. Data locality is critical for efficient, distributed data processing. If YARN can’t meet an application’s locality constraints for a specific node, it attempts to proceed with progressively less local data. It may request a container on a replica node, then a container within the same rack, and finally

   a container not within the same rack. Or, it can halt the request entirely.

  Another example is the practice of speculative execution. Once an application is running, the application manager can identify tasks that are slower than expected. It can’t affect the pace of the slow task and doesn’t diagnose the issue (e.g., hardware degradation). Instead, it requests that YARN launch another instance of the task on a copy of the data as a backup. Once one task completes, redundant tasks are killed. This speculative execution increases fault tolerance at the cost of efficiency, especially for busy clusters. Knowing when to use speculative execution (and when not to use it, like for reduce tasks) has an impact on overall cluster performance.

  Applications overestimate needed resources In an ideal world, the application would request only the resources it needed.

  Theoretically, you could predict resources for particular workloads by testing each cluster variable independently. For example, you could test how performance changes based on:

  Number of cores Number of nodes Load changes (size and intensity)

  Gathering this data scientifically should allow you to perfectly estimate the resources you need. But you’d also have to repeat these experiments any time something in the cluster changed (versions, platforms, hardware, etc). As a result, resource requests are typically tuned for worst-case scenarios (upper- limit thresholds).

  How YARN affects cluster utilization

  YARN decides to start a job once it’s confirmed that it has enough resources to meet the job’s request. Once the job is running, YARN locks up all the requested resources regardless of whether or not they are needed. For example, if a job requests memory resources with an upper limit of 3 GB, the cluster allocates the entire 3 GB, even if the running job is only using 1.5 GB. This results in your monitoring tools reporting submaximal utilization, and

  

  YARN can’t prevent resource contention

  Finally, though YARN controls starting and stopping processes across its cluster, in-progress jobs are allowed uncontrolled competition for available resources. Additionally, YARN manages CPU and memory allocations, but these are not the only resources that can have contention. Managing other elements, like disk and network I/O, is planned but not currently supported. This uncontrolled contention puts overall performance in a fragile state where one unexpected snag could jeopardize the entire cluster. For dynamic, mixed- workload, and/or multitenant clusters, resource allocation alone cannot guarantee consistent performance with high utilization.

DRAWBACKS OF RESOURCE SCHEDULERS (LIKE YARN)

  Speculative execution improves fault tolerance, but can have a negative impact on cluster efficiency (especially for busy clusters). Applications poorly estimate the resources they need, typically requesting maximum (upper- limit) thresholds. YARN locks up maximum requested resources for particular jobs, regardless of whether or not the resources are needed as the job is running in real-time. YARN manages CPU and memory allocations, but does not currently support all resources (like disk and network I/O).

While YARN controls when jobs start and stop based on available resources, it cannot

manage resource contention once jobs are active.

  Improving the performance of your cluster

  You can use other resource and cluster managers to help YARN work more effectively. While YARN controls Hadoop, Mesos was built to be a resource manager for an entire data center. It utilizes two-level scheduling (meaning that the requestor has an opportunity to reject the resource scheduler’s “offer” for job placement). Two-level schedulers are more scalable than monolithic schedulers (like YARN) and allow the framework to decide whether an offer is a good fit for the job. Mesos can schedule resources that YARN doesn’t (like network and disk I/O).

  

  allows Mesos to elastically allocate resources to YARN (making it more dynamic), which should improve utilization of the entire data center. But unless Mesos revokes allocated resources, YARN still locks up maximum resource thresholds once a job is running, leaving resource contention and low utilization as persistent issues. Virtualized Hadoop (either private or public cloud) enables elastic scaling, which means dynamically adding (and then removing) nodes as needed. This could assist during peak load or times of resource contention. Just like YARN and Mesos, though, the hypervisor will assign new nodes to meet maximum resource requests. More nodes does not lead to more utilized nodes.

  Clusters are drastically underutilized

  We know that the tactics of existing resource managers leads to overprovisioning in their efforts to improve performance, but we haven’t yet specified the severity of the problem. Best-in-class solutions are simply not providing perfect performance with full utilization. Industry-wide, cluster

  

  utilization averages only 6–12% The most efficient clusters (like Google and Twitter, which co-locate workloads) still only report up to 50% utilization. Trace analysis of Google’s cluster shows it allocating ~100% of its CPU and ~80% of its memory, but usage of these resources is much lower. Over the 29-day trace period, actual memory did not exceed 50% capacity. CPU usage

  

  peaked at 60% but remained below 50% most of the time. Similar results were observed for a large production cluster at Twitter (on Mesos), which showed allocations of ~70% for CPU and ~80% memory, with actual usage

  

  The Need for Improved Resource Prediction and Real- Time Allocation Tools

  Traditional optimization techniques may temporarily improve performance but are expensive, time-consuming, and cannot manage the volatility of modern clusters. These performance improvements come with a cost of low utilization.

  Reiss and Tumanov, et al., studied Google’s cluster as an example of a large cluster with heterogeneous workloads. They provide a nice summary of

   recommendations based on their observations in their 2012 SOCC report.

  We’d like to discuss two of those recommendations and which emerging products can help your cluster meet the need for efficient, consistent performance:

  Recommendation 1: better resource prediction

  First, better prediction of resource needs would eliminate overallocation of resources. As Reiss states, “Resource schedulers need more

  

  sophisticated time-based models of resource usage.”

  Recommendation 2: real-time resource allocation

  Secondly, managers must be able to dynamically adjust resources based on real-time usage (not allocation). “To achieve high utilization in the face of usage spikes, schedulers should not set aside resources but have resources that can be made available by stopping or mitigating fine-

  

  grained or low-priority workloads In order to accomplish the latter recommendation, the manager must allow the user to set priorities in order for real-time allocation to function

  

  effectively. he manager must be able to dynamically decide where resources can be siphoned from and where they should be fed to. This can’t rely on human intervention — a problem that requires lots of tiny decisions per second requires a programmatic solution.

  We’d like to introduce two tools that meet both recommendations. One is focused on resource prediction, the other on real-time allocation.

  Quasar: A performance constrained, workload profiling cluster manager

  In the spring of 2014, Christina Delimitrou and Christos Kozyrakis from Stanford University published a report on their new cluster manager (named

  

  Quasar) that provides programmatic resource prediction. uasar allows users to specify performance constraints, then it profiles incoming workloads to classify their resource needs. The resource classification is provided to a “greedy” scheduler that looks to allocate the least resources (minimum threshold instead of maximum threshold) to the job to satisfy the performance

   target.

  Their prediction/classification technique uses two models in tandem: wavelet

  

  transform prediction and workload pattern classification. The wavelet method decompresses a resource trace into individual wavelets and then predicts future needs based on each wavelet’s pattern. The workload classification method breaks a resource trace down into smaller models (e.g., spike, plateau, and sinusoidal). When a workload’s behavior changes, it is matched to a new model and resources are reallocated accordingly (albeit

  

  with a slow sampling rate of 10 minutes) On a 200-node cluster, Quasar achieved a 62% overall utilization and achieved its performance constraints

   for batch and latency-critical workloads.

  At this time, Quasar is not open source or commercialized for immediate adoption. Christos Kozyrakis joined Mesosphere in the fall of 2014, and some Quasar code was released in July 2015 as part of Mesos 0.23.0. Termed

   , this version provides “experimental support” for

  launching tasks with resources that can be revoked at any time. It’s reported that other features may be reserved for Mesosphere’s DCOS (Data Center

12 Operating System) Enterprise Edition.

  Pepperdata: A real-time resource allocation performance optimizer ounded in 2012, offers real-time, active performance

  optimization software that can be deployed on top of a big data cluster (like Hadoop or Spark). It’s compatible with Hadoop 1.0, YARN, and all Hadoop distributions (Cloudera, Hortonworks, MapR, IBM BigInsights, Pivotal PHD,

   and Apache).

  Pepperdata software monitors hardware (CPU, memory, disk I/O, network) in real time by process, job, and user; is aware of actual hardware usage across the entire cluster, second by second; and reshapes hardware usage, dynamically and in real time, to adapt to the ever-changing conditions of cluster chaos. Pepperdata installs a node agent on every node in the cluster (with low 0.1%

   It

  monitors hardware (CPU, memory, disk I/O, and network) by process, job, and user; is aware of actual hardware usage across the entire cluster; and dynamically reshapes hardware usage in real time to adapt to the ever- changing conditions of a chaotic cluster.

  Unlike most monitoring-only tools (which just report node metrics or provide tuning recommendations based on past behavior), Pepperdata dynamically allocates resources according to real-time usage. If YARN allocates 4 GB of memory to a particular job but only 2 GB are being used, Pepperdata re- informs YARN of the usage number and allows YARN to assign the

  

  Users specify simple priorities (for example, production gets 50% of the cluster, data science gets 25%, and so forth) without static partitioning. In times of nonpeak use, data science jobs can exceed those priority thresholds as needed. In times of resource contention, Pepperdata reallocates resources from lower-priority jobs to help high-priority jobs complete on time. It is reported that an average user can expect a 30–50% increase in throughput by

  

  adding Pepperdata to their cluster Chartboost, a mobile-games platform, saw a throughput boost of 31% on its primary AWS cluster after installing Pepperdata software, which allowed the company to decrease its AWS node

  

  Conclusion

  Traditional best practices can improve performance and may be enough for clusters with single workloads and consistent needs. More often, these methods are nonscalable stop-gaps that can’t lead to QoS for dynamic, mixed workload, or multitenant clusters. The conservative actions of resource managers and the practice of overprovisioning may help with peak resource contention, but leads to drastic underutilization. In most clusters, 88–94% of the resources are left as dormant assets. Trace analysis of a large, heterogeneous cluster pointed to a need for better resource prediction and real-time resource allocation to improve performance and increase utilization. Pepperdata allows a Hadoop cluster to allocate resources in real time by re-informing YARN of actual (not theoretical) resource usage, leading to 30–50% higher throughput. Emerging technologies like Quasar enhance resource prediction and allow resource managers to provide “greedy” or “oversubscribed” scheduling to improve utilization. It will be exciting to see how these products and future developments lower the cost and improve the function of big data analysis for large clusters.

  References 1. Reiss, C., Tumanov, A., Ganger, G.R., Katz, R.H., Kozuch, M.A.

  “Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis.” SOCC, 2012.

  2. White, Tom. Hadoop: The Definitive Guide, 4th Edition. Sebastopol, CA: O’Reilly Media, 2015.

  3. Ren-Chyan Chern, F. (2014, March 5). Hadoop Performance Tuning Best Practices [Weblog post]. Retrieved from

  

  4. “Pepperdata on a Highly Tuned Hadoop Cluster.” Pepperdata, June 2015. 5. “Hadoop Performance Tuning - A Pragmatic & Iterative Approach.” DHTechnologies, 2013.

  6. Kopp, M. (2013, July 17). Top Performance Problems discussed at the Hadoop and Cassandra Summits [Weblog post]. Retrieved from

  

  7. loudera, 13 January 2016. 8. [Twitter University]. (2014, April 8). Improving Resource

  Efficiency with Apache Mesos. [Video File]. Retrieved from . 9. “4 Warning Signs That Your Hadoop Cluster Isn’t Optimized... And how Pepperdata Can Help.” Pepperdata.

  10.

  11. “Now Big Data Works for Every Enterprise: Pepperdata Adds Missing Performance QoS to Hadoop.” Taneja Group, 2015.

  12. Morgan, T.P. (2015, June 9). Mesos Brings The Google Way To The Global 2000 [Weblog post]. Retrieved from

  

  13. “Pepperdata Overview and Differentiators.” Pepperdata, 2014. 14. “Chartboost sees significant AWS savings with Pepperdata.” Pepperdata.

  About the Author

Courtney Webster is a reformed chemist in the Washington, D.C. metro

  area. She spent a few years after grad school programming robots to do chemistry and is now managing web and mobile applications for clinical research trials.