Terabyte Sort Benchmark
5.4.1 Terabyte Sort Benchmark
The Terabyte sort benchmark has its roots in benchmark tests sorting conducted on computer systems since the 1980s. More recently, a Web site originally spon- sored by Microsoft and one of its research scientists Jim Gray has conducted formal competitions each year with the results presented at the SIGMOD (Special Interest Group for Management of Data) conference sponsored by the ACM each year ( http://sortbenchmark.org ). Several categories for sorting on systems exist including the Terabyte sort which was to measure how fast a file of 1 Terabyte of data format- ted in 100 byte records (10,000,000 total records) could be sorted. Two categories were allowed called Daytona (a standard commercial computer system and software
110 A.M. Middleton with no modifications) and Indy (a custom computer system with any type of modi-
fication). No restrictions existed on the size of the system so the sorting benchmark could be conducted on as large a system as desired. The current 2009 record holder for the Daytona category is Yahoo! using a Hadoop configuration with 1460 nodes with 8 GB Ram per node, 8000 Map tasks, and 2700 Reduce tasks which sorted
1 TB in 62 seconds (O’Malley & Murthy, 2009 ). In 2008 using 910 nodes, Yahoo! performed the benchmark in 3 minutes 29 seconds. In 2008, LexisNexis using the HPCC architecture on only a 400-node system performed the Terabyte sort bench- mark in 3 minutes 6 seconds. In 2009, LexisNexis again using only a 400-node configuration performed the Terabyte sort benchmark in 102 seconds.
However, a fair and more logical comparison of the capability of data-intensive computer system and software architectures using computing clusters would be to conduct this benchmark on the same hardware configuration. Other factors should also be evaluated such as the amount of code required to perform the bench- mark which is a strong indication of programmer productivity, which in itself is
a significant performance factor in the implementation of data-intensive computing applications. On August 8, 2009 a Terabyte Sort benchmark test was conducted on a devel- opment configuration located at LexisNexis Risk Solutions offices in Boca Raton, FL in conjunction with and verified by Lawrence Livermore National Labs (LLNL). The test cluster included 400 processing nodes each with two local 300 MB SCSI disk drives, Dual Intel Xeon single core processors running at 3.00 GHz, 4 GB mem- ory per node, all connected to a single Gigabit ethernet switch with 1.4 Terabytes/sec throughput. Hadoop Release 0.19 was deployed to the cluster and the standard Terasort benchmark written in Java included with the release was used for the bench- mark. Hadoop required 6 minutes 45 seconds to create the test data, and the Terasort benchmark required a total of 25 minutes 28 seconds to complete the sorting test
as shown in Fig. 5.13 . The HPCC system software deployed to the same platform and using standard ECL required 2 minutes and 35 seconds to create the test data, and a total of 6 minutes and 27 seconds to complete the sorting test as shown in
Fig. 5.13 Hadoop terabyte sort benchmark results
5 Data-Intensive Technologies for Cloud Computing 111
Fig. 5.14 HPCC terabyte sort benchmark results
Fig. 5.14 . Thus the Hadoop implementation using Java running on the same hard- ware configuration took 3.95 times longer than the HPCC implementation using ECL.
The Hadoop version of the benchmark used hand-tuned Java code including custom TeraSort, TeraInputFormat and TeraOutputFormat classes with a total of 562 lines of code required for the sort. The HPCC system required only 10 lines of ECL code for the sort, a 50-times reduction in the amount of code required.
Parts
» Cloud Computing Model Application Methodology
» Cloud Computing in Development/Test
» Cloud-Based High Performance Computing Clusters
» Case Study: Cloud as Infrastructure for an Internet Data Center (IDC)
» Case Study – Cloud Computing for Software Parks
» Case Study – an Enterprise with Multiple Data Centers
» Service-Management Integration
» Cloud Deployment Models and the Network
» Latency, Bandwidth, and Scale – The Span
» Security, Resiliency, and Service Management – The Superstructure
» Network Architecture for Hybrid Cloud Deployments
» Characteristics of Data-Intensive Computing Systems
» Related Work on DS Scheduling
» Scheduling Issues Inside Service Oriented Environments
» Scheduling Through Negotiation
» Prototype Implementation Details
» Basics of Grid and Cloud Computing
» Service Orientation and Web Services
» Scheduling, Metascheduling, and Resource Provisioning
» Interoperability in Grids and Clouds
» Security and User Management
» Modeling and Simulation of Clouds and Grids
» Enterprise Knowledge Cloud Technologies
» NC State University Cloud Computing Implementation
» Computational/Data Node Network
» Integrating High-Performance Computing into the VCL
» Large Scale Workflow Applications
» Overview of SwinDeW-G Environment
» SwinDeW-C System Architecture
» History on Enterprise IT Services
» Remote Sensing Geo-Reprojection
» Singular Value Decomposition
» System Model for Auction in CACM
» Multi-objective Genetic Algorithm
» Anagram Based GrADS Data Distribution Service
» Hyrax Based Five Dimension Distribution Data Service
» Design and Implementation of an Instrument Service for NetCDF Data Acquisition
» A Weather Forecast Quality Evaluation Scenario
» Implementation of the Grid Application
» Overview of Amazon EC2 Setup
» DGEMM Single Node Evaluation
» Data Clouds: Emerging Technologies
» Cloud Architecture as Foundation of Cloud-Based Scientific Applications
» Emergence of Cloud-Based Scientific Computational
» Setup and Experiment on Tiny Cloud Infrastructure
» On Economical Use of the Enterprise Cloud
» Toward Integration Of Private and Public Enterprise Cloud Environment
» Brief Discussion of Medical Standards
» Objective 1: A Service Oriented Architecture for Interfacing Medical Messages
» Objective 4: A Globally Distributed Dynamically Scalable Cloud Based Application Architecture
» Distributing Multidimensional Environmental Data
» Enabling the NetCDF Java Interface to S3
» The NetCDF Service Architecture
» NetCDF Service Deployment Scenarios
» Evaluation of S3- and EBS-Enabled NetCDF Java Interfaces
Show more