Big Data Analytics Practical Managers 6574 pdf pdf

  A Practical Guide for Managers Kim H. Pries Robert Dunnigan

  workings of big data tools. Instead of spending time on HOW to install specific

  • BIG DATA ANALYTICS

  

BIG DATA

ANALYTICS

A Practical Guide

for Managers

  

BIG DATA

ANALYTICS

A Practical Guide

for Managers

  

Kim H. Pries

Robert Dunnigan

  

MATLAB® and Simulink® are trademarks of The MathWorks, Inc. and are used with permission. The Math-

Works does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion

of MATLAB® and Simulink® software or related products does not constitute endorsement or sponsorship

by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® and Simulink®

software.

  CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2015 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20141024 International Standard Book Number-13: 978-1-4822-3452-7 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts

have been made to publish reliable data and information, but the author and publisher cannot assume

responsibility for the validity of all materials or the consequences of their use. The authors and publishers

have attempted to trace the copyright holders of all material reproduced in this publication and apologize to

copyright holders if permission to publish in this form has not been obtained. If any copyright material has

not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-

ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,

including photocopying, microfilming, and recording, or in any information storage or retrieval system,

without written permission from the publishers.

  

For permission to photocopy or use material electronically from this work, please access www.copyright.

com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood

Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and

registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,

a separate system of payment has been arranged.

  

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used

only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at

  Contents

  Preface ................................................................................................. xiii Acknowledgments ................................................................................ xv Authors ................................................................................................xvii

  

Chapter 1 Introduction ....................................................................... 1

So What Is Big Data? ...................................................................1 Growing Interest in Decision Making ......................................4 What This Book Addresses ........................................................6 The Conversation about Big Data..............................................7 Technological Change as a Driver of Big Data ......................12 The Central Question: So What? .............................................13 Our Goals as Authors ...............................................................18 References ...................................................................................19 Chapter 2 The Mother of Invention’s Triplets: Moore’s Law, the Proliferation of Data, and Data Storage Technology .......21 Moore’s Law................................................................................22 Parallel Computing, between and within Machines ............25 Quantum Computing ...............................................................31 Recap of Growth in Computing Power ..................................31 Storage, Storage Everywhere ....................................................32 Grist for the Mill: Data Used and Unused .............................39 Agriculture ................................................................................ 40 Automotive ................................................................................ 42 Marketing in the Physical World ............................................45 Online Marketing ......................................................................49 Asset Reliability and Efficiency .............................................. 54 Process Tracking and Automation......................................... 56 Toward a Definition of Big Data ..............................................58 Putting Big Data in Context ....................................................62 Key Concepts of Big Data and Their Consequences ............ 64 Summary ....................................................................................67 References ...................................................................................67

  vi • Contents

  

Chapter 3 Hadoop .............................................................................. 73

Power through Distribution ....................................................75 Cost Effectiveness of Hadoop .............................................79 Not Every Problem Is a Nail ....................................................81 Some Technical Aspects ......................................................81 Troubleshooting Hadoop .........................................................83 Running Hadoop ...................................................................... 84 Hadoop File System .................................................................. 84 MapReduce ........................................................................... 86 Pig and Hive .............................................................................. 90 Installation .................................................................................91 Current Hadoop Ecosystem .....................................................91 Hadoop Vendors ........................................................................94 Cloudera .................................................................................94 Amazon Web Services (AWS) .................................................95 Hortonworks ..............................................................................97 IBM ..............................................................................................97 Intel ............................................................................................. 99 MapR ........................................................................................ 100 Microsoft .................................................................................. 100 Running Pig Latin Using Powershell ...............................101 Pivotal .......................................................................................103 References .................................................................................104

Chapter 4 HBase and Other Big Data Databases........................... 105

Evolution from Flat File to the Three V’s .............................105 Flat File .................................................................................106 Hierarchical Database ........................................................110 Relational Database ............................................................111 Object-Oriented Databases ...............................................114 Relational-Object Databases .............................................114 Transition to Big Data Databases ..........................................115 What Is Different about HBase? .......................................116 What Is Bigtable? ................................................................119 What Is MapReduce? ........................................................ 120 What Are the Various Modalities for Big Data

  Contents • vii Graph Databases ..................................................................... 123

  How Does a Graph Database Work? ............................... 123 What Is the Performance of a Graph Database? ........... 124

  Document Databases ............................................................. 124 Key-Value Databases ...............................................................131 Column-Oriented Databases .................................................138

  HBase ....................................................................................138 Apache Accumulo ..............................................................142

  References .................................................................................149

  

Chapter 5 Machine Learning .......................................................... 151

Machine Learning Basics .......................................................151 Classifying with Nearest Neighbors .....................................153 Naive Bayes .............................................................................. 154 Support Vector Machines .......................................................155 Improving Classification with Adaptive Boosting .............156 Regression .................................................................................157 Logistic Regression ..................................................................158 Tree-Based Regression ............................................................160 K-Means Clustering ................................................................161 Apriori Algorithm ...................................................................162 Frequent Pattern-Growth .......................................................164 Principal Component Analysis (PCA) .................................165 Singular Value Decomposition ..............................................166 Neural Networks .....................................................................168 Big Data and MapReduce .......................................................173 Data Exploration .....................................................................175 Spam Filtering ..........................................................................176 Ranking ....................................................................................177 Predictive Regression ..............................................................177 Text Regression ........................................................................178 Multidimensional Scaling ......................................................179 Social Graphing .......................................................................182 References .................................................................................191

Chapter 6 Statistics .......................................................................... 193

Statistics, Statistics Everywhere .............................................193

  viii • Contents

  Standard Deviation: The Standard Measure of Dispersion ................................................................................ 200 The Power of Shapes: Distributions ......................................201 Distributions: Gaussian Curve ............................................. 205 Distributions: Why Be Normal? ............................................214 Distributions: The Long Arm of the Power Law ................ 220 The Upshot? Statistics Are Not Bloodless ........................... 227 Fooling Ourselves: Seeing What We Want to See in the Data .......................................................................................... 228 We Can Learn Much from an Octopus ................................232 Hypothesis Testing: Seeking a Verdict ................................ 234

  Two-Tailed Testing ............................................................ 240 Hypothesis Testing: A Broad Field ........................................241 Moving On to Specific Hypothesis Tests ............................ 242 Regression and Correlation ................................................... 247 p Value in Hypothesis Testing: A Successful Gatekeeper? ............................................................................254 Specious Correlations and Overfitting the Data ................ 268 A Sample of Common Statistical Software Packages .........273

  Minitab .................................................................................273 SPSS ......................................................................................274 R ............................................................................................275 SAS ....................................................................................... 277

  Big Data Analytics ........................................................ 277 Hadoop Integration .......................................................278

  Angoss ..................................................................................278 Statistica ...............................................................................279

  Capabilities .....................................................................279 Summary ................................................................................. 280

  

Chapter 7 Google ............................................................................. 285

Big Data Giants ....................................................................... 285 Google ...................................................................................... 286 Go ......................................................................................... 292 Android ................................................................................293 Google Product Offerings ................................................. 294

  Contents • ix Advertising and Campaign Performance.................. 299 Analysis and Testing .................................................... 300

  Facebook .................................................................................. 308 Ning ...........................................................................................310 Non-United States Social Media ...........................................311

  Tencent .................................................................................311 Line .......................................................................................311

  Sina Weibo ...........................................................................312 Odnoklassniki .....................................................................312 Vkontakte ............................................................................312 Nimbuzz ...............................................................................312

  Ranking Network Sites ...........................................................313 Negative Issues with Social Networks ..................................314 Amazon .....................................................................................316 Some Final Words .................................................................. 320 References .................................................................................321

  

Chapter 8 Geographic Information Systems (GIS) ........................ 323

GIS Implementations ............................................................. 324 A GIS Example .........................................................................332 GIS Tools ...................................................................................335 GIS Databases ......................................................................... 346 References ................................................................................ 348

Chapter 9 Discovery ........................................................................ 351

Faceted Search versus Strict Taxonomy ...............................352 First Key Ability: Breaking Down Barriers ........................ 356 Second Key Ability: Flexible Search and Navigation ........... 358 Underlying Technology ......................................................... 364 The Upshot .............................................................................. 365 Summary ................................................................................. 366 References ................................................................................ 367

Chapter 10 Data Quality ................................................................... 369

Know Thy Data and Thyself .................................................. 369 Structured, Unstructured, and Semistructured Data ........373 Data Inconsistency: An Example from This Book ..............374

  x • Contents

  How Data Can Fool Us ...........................................................379 Ambiguous Data .................................................................379 Aging of Data or Variables ............................................... 384 Missing Variables May Change the Meaning ................ 386 Inconsistent Use of Units and Terminology .................. 388

  Biases ........................................................................................ 392 Sampling Bias ..................................................................... 392 Publication Bias ................................................................. 396 Survivorship Bias ............................................................... 396

  Data as a Video, Not a Snapshot: Different Viewpoints as a Noise Filter ....................................................................... 400 What Is My Toolkit for Improving My Data? .................... 406

  Ishikawa Diagram ............................................................. 409 Interrelationship Digraph ..................................................412 Force Field Analysis ............................................................414

  Data-Centric Methods ............................................................415 Troubleshooting Queries from Source Data ...................416 Troubleshooting Data Quality beyond the Source System ...................................................................................419 Using Our Hidden Resources .......................................... 422

  Summary ................................................................................. 423 References ................................................................................ 424

  

Chapter 11 Benefits ........................................................................... 427

Data Serendipity ..................................................................... 427 Converting Data Dreck to Usefulness ................................. 428 Sales .......................................................................................... 430 Returned Merchandise .......................................................... 432 Security .................................................................................... 434 Medical .................................................................................... 435 Travel ........................................................................................ 437 Lodging ............................................................................... 437 Vehicle ................................................................................. 439 Meals ................................................................................... 440 Geographical Information Systems ..................................... 442 New York City .................................................................... 442 Chicago CLEARMAP ....................................................... 443 Baltimore ............................................................................ 446

  Contents • xi San Francisco ..................................................................... 448 Los Angeles ......................................................................... 449 Tucson, Arizona, University of Arizona, and COPLINK ............................................................................451

  Social Networking ...................................................................452 Education ................................................................................. 454

  General Educational Data ................................................ 454 Legacy Data .........................................................................455 Grades and Other Indicators ........................................... 456 Testing Results ................................................................... 456 Addresses, Phone Numbers, and More .......................... 457

  Concluding Comments ......................................................... 458 References ................................................................................ 459

  

Chapter 12 Concerns ......................................................................... 463

Logical Fallacies ...................................................................... 469 Affirming the Consequent .................................................470 Denying the Antecedent ....................................................471 Ludic Fallacy .......................................................................473 Cognitive Biases .......................................................................473 Confirmation Bias ..............................................................473 Notational Bias ....................................................................475 Selection/Sample Bias ........................................................475 Halo Effect ...........................................................................476 Consistency and Hindsight Biases .................................. 477 Congruence Bias .................................................................478 Von Restorff Effect ..............................................................478 Data Serendipity ......................................................................479 Converting Data Dreck to Usefulness .............................479 Sales ...........................................................................................479 Merchandise Returns ............................................................. 482 Security .................................................................................... 483 CompStat ............................................................................ 483 Medical ................................................................................ 486 Travel ........................................................................................ 487 Lodging ............................................................................... 487 Vehicle ................................................................................. 488 Meals ................................................................................... 490

  xii • Contents

  Social Networking ...................................................................491 Education ................................................................................. 492 Making Yourself Harder to Track ........................................ 497

  Misinformation .................................................................. 498 Disinformation ................................................................... 499 Reducing/Eliminating Profiles ........................................ 500

  Social Media .................................................................. 500 Self Redefinition ............................................................ 500 Identity Theft ..................................................................501

  Facebook ............................................................................. 503 Concluding Comments ..........................................................519 References .................................................................................521

  

Chapter 13 Epilogue .......................................................................... 525

Michael Porter’s Five Forces Model ......................................527 Bargaining Power of Customers .......................................528 Bargaining Power of Suppliers ..........................................530 Threat of New Entrants ......................................................531 Others ...................................................................................533 The OODA Loop ......................................................................533 Implementing Big Data ...........................................................534 Nonlinear, Qualitative Thinking ...........................................538 Closing ......................................................................................539 References ................................................................................ 540

  Preface

  When we started this book, “big data” had not quite become a business buzzword. As we did our research, we realized the books we perused were either of the “Gee, whiz! Can you believe this?” class or incredibly abstruse. We felt the market needed explanation oriented toward manag- ers who had to make potentially expensive decisions.

  We would like managers and implementors to know where to start when they decide to pursue the big data option. As we indicate, the marketplace for big data is much like that for personal computing in the early 1980s— full of consultants, products with bizarre names, and tons of hyperbole. Luckily, in the 2010s, much of the software is open source and extremely powerful. Big data consultancies exist to translate this “free” software into useful tools for the enterprise. Hence, nothing is really free.

  We also ensure our readers can understand both the benefits and the costs of big data in the marketplace, especially the dark side of data. By now, we think it is obvious that the US National Security Agency is an archetype for big data problem solving. Large-city police departments have their own statistical data tools and some of them ponder the useful- ness of cell phone confiscation and investigation as well as the use of social media, which are public.

  As we researched, we found ourselves surprised at the size of well-known marketers such as Google and Amazon. Both of these enterprises have purchased companies and have grown themselves organically. Facebook continues to purchase companies (e.g., Oculus, the supplier of a poten- tially game-changing virtual reality system) and has over 1 billion users. Algorithmic analysis of colossal volumes of data yields information; infor- mation allows vendors to tickle our buying reflexes before we even know our own patterns.

  Previously, we thought Esri owned the geographical information sys- tems market, but we found a variety of geographical information systems solutions—although the Esri product line is relatively mature and they serve large-city police departments across the United States. Database cre- ators explore new ways of looking at and storing/retrieving data—methods going beyond the relational paradigm. New and old algorithmic methods xiv • Preface called machine learning allow computers to sort and separate the useful data from the useless.

  We have grown to appreciate the open-source statistical language R over the years. R has become the statistical lingua franca for big data. Some of the major statistical vendors advertise their functional partnerships with R. We use the tool ourselves to generate many of our figures. We suspect R is now the most powerful generally available statistical tool on the planet.

  Let’s move on and see what we can learn about big data! ® MATLAB is a registered trademark of The MathWorks, Inc. For product information, please contact: The MathWorks, Inc.

  3 Apple Hill Drive Natick, MA 01760-2098 USA Tel: 508 647 7000 Fax: 508-647-7001 E-mail: info@mathworks.com Web: www.mathworks.com

  Acknowledgments

Kim H. Pries would like to acknowledge Janise Pries, the love of his life,

  for her support and editing skills. In addition, Robert Dunnigan supplied verbiage, chapters, Six Sigma expertise, and big data professionalism. As always, John Wyzalek and the Taylor & Francis team are key players in the production and publication of technical works such as this one.

  

Robert Dunnigan thanks his wife, Flabia Dunnigan, and his son Robert III

  for their love and patience during the composition of this book. He would also like to thank Kim H. Pries for his depth of expertise in a broad array of technical subjects as well as his experience as an author. He skillfully navigated the process of proposing, developing, and finalizing what is a unique and practical offering in the field of big data literature. Robert would also like to thank his employer, The Kratos Group, for their interest and moral support during the writing of this book. Kratos is a remarkable company of which Robert is proud to be a part. Finally, thanks are due to Taylor & Francis for bringing this new perspective on big data to market.

  Authors

Kim H. Pries has four college degrees: a bachelor of arts in history from

  the University of Texas at El Paso (UTEP), a bachelor of science in metal- lurgical engineering from UTEP, a master of science in engineering from UTEP, and a master of science in metallurgical engineering and materials science from Carnegie-Mellon University. In addition, he holds the fol- lowing certifications:

  • APICS
  • Certified Production and Inventory Manager (C>American Society for Quality (ASQ)
  • Certified Reliability Engineer (CRE)
  • Certified Quality Engineer (CQE)
  • Certified Software Quality Engineer (CSQE)
  • Certified Six Sigma Black Belt (CSSBB)
  • Certified Manager of Quality/Operational Excellence (CMQ/OE)
  • Certified Quality Auditor (CQA) Pries worked as a computer systems manager, a software engineer for an electrical utility, and a scientific programmer under a defense contract; for Stoneridge, Incorporated (SRI), he has worked as the follow>Software manager
  • Engineering services manager
  • Reliability section manager
  • Product integrity and reliability director In addition to his other responsibilities, Pries has provided Six Sigma training for both UTEP and SRI, and cost reduction initiatives for SRI. Pries is also a founding faculty member of Practical Project Management. Additionally, in concert with Jon Quigley, Pries was a cofounder and prin- cipal with Value Transformation, LLC, a training, testing, cost improve- ment, and product development consultancy. Pries also holds Texas teacher certifications in:
xviii • Authors

  • Mathematics (8–12)
  • Mathematics (4–8)
  • Technology education (6–12)
  • Technology applications (EC–12)
  • Physics (8–12)
  • Generalist (4–8)
  • English Language Arts and Reading (8–12)
  • History (8–12)
  • Computer Science (8–12)
  • Science (8–12)
  • Special education (EC–12) He trained for Introduction to Engineering Design and Computer Science and Software Engineering with Project Lead the Way. He cur- rently teaches biotechnology, computer science and software engineering, and introduction to engineering design at the beautiful Parkland High School in the Ysleta Independent School District of El Paso, Texas.

  Pries authored or coauthored the following books:

  • Six Sigma for the Next Millennium: A CSSBB Guidebook (Quality

  Press, 2005)

  • Six Sigma for the New Millennium: A CSSBB Guidebook, Second Edition (Quality Press, 2009)
  • Project Management of Complex and Embedded Systems: Ensuring Product Integrity and Program Quality (CRC Press, 2008), with Jon M. Quigley • Scrum Project Management (CRC Press, 2010), with Jon M. Quigley • Testing Complex and Embedded Systems (CRC Press, 2010), with Jon

  M. Quigley 2012), with Jon M. Quigley

  • Reducing Process Costs with Lean, Six Sigma, and Value Engineering Techniques (CRC Press, 2012), with Jon M. Quigley • A School Counselor’s Guide to Ethics (Counselor Connection Press,

  2012), with Janise G. Pries

  • A School Counselor’s Guide to Techniques (Counselor Connection Press, 2012), with Janise G. Pries • A School Counselor’s Guide to Group Counseling (Counselor

  Authors • xix

  • A School Counselor’s Guide to Practicum (Counselor Connection Press, 2013), with Janise G. Pries • A School Counselor’s Guide to Counseling Theories (Counselor

  Connection Press, 2013), with Janise G. Pries

  • A School Counselor’s Guide to Assessment, Appraisal, Statistics, and

  Research (Counselor Connection Press, 2013), with Janise G. Pries

  

Robert Dunnigan is a manager with The Kratos Group and is based in

  Dallas, Texas. He holds a bachelor of science in psychology and in sociol- ogy with an anthropology emphasis from North Dakota State University. He also holds a master of business administration from INSEAD, “the business school for the world,” where he attended the Singapore campus.

  As a Peace Corps volunteer, Robert served over 3  years in Honduras developing agribusiness opportunities. As a consultant, he later worked on the Afghanistan Small and Medium Enterprise Development project in Afghanistan, where he traveled the country with his Afghan colleagues and friends seeking opportunities to develop a manufacturing sector in the country.

  Robert is an American Society for Quality certified Six Sigma Black Belt and a Scrum Alliance certified Scrum Master.

1 Introduction SO WHAT IS BIG DATA?

  As a manager, you are expected to operate as a factotum. You need to be an industrial/organizational psychologist, a logician, a bean counter, and a representative of your company to the outside world. In other words, you are somewhat of a generalist who can dive into specifics. The specific technologies you encounter are becoming more complex, yet the differ- ences between them and their predecessors are becoming more nuanced.

  You may have already guided your firm’s transition to other new technol- ogies. Think of the Internet. In the decade and a half before this book was written, Internet presence went from being optional to being mandatory for most businesses. In the past decade, Internet presence went from being unidirectional to conversational. Once, your firm could hang out its online shingle with either information about its physical location, hours, and offerings if it were a brick-and-mortar business or else your offerings and an automated payment system if it were an online business. Firms ranging from Barnes & Noble to your corner pizza chain bridged these worlds.

  A new buzzword arrived: Web 2.0. Despite much hyperbolic rhetoric, this designation described the real phenomenon of a reciprocal online the archetypical Web 2.0 technology called social media could cause real damage to your firm. Two news stories involving Twitter broke as this introduction was in its final stages of refinement.

  First, Brendan Eich, the new CEO of the software organization Mozilla (creator of the Firefox browser), stepped down after news surfaced indi- cating he had donated money in support of Proposition 8, an anti–gay marriage initiative in California, some 6 years before (in 2008). An uproar erupted—largely on Twitter—which led Mr. Eich to resign. Voices in

  2 • Big Data Analytics

  Mr. Eich’s defense from across the political spectrum—including Andrew Sullivan, the respected conservative columnist who is himself gay and a proponent for gay marriage rights, and Conor Friedersdorf of The Atlantic, who was also an outspoken opponent of Proposition 8—did not save Mr. Eich’s job. He was ousted.

  The second Twitter story began with a tweeted complaint from a cus- tomer with the Twitter handle @ElleRafter. US Airways responded with the typical reaction of a company facing such a complaint in the public forum of Twitter. They invited @ElleRafter to provide more information, along with a link. Unlike the typical Twitter response, however, the US Airways tweet included a pornographic photo involving the use of a toy US Airways aircraft. This does not appear to have been a premeditated act by the US Airways representative involved—but it caused substantial humiliating press coverage for the company.

  As the Internet spread and matured, it became a necessary forum for communication, as well as a dangerous tool whose potential for good or bad can pull in others by surprise or cause self-inflicted harm. Just as World War I generals were left to figure out how technology changed the field of battle, shifting the advantage from the offense to the defense, Internet tech- nology left managers trying to cope with a new landscape filled with both promise and threats. Now, there is another new buzzword: big data.

  So, what is big data? Is it a fad? Is it empty jargon? Is it just a new name for growing capacity of the same databases that have been a part of our lives for decades? Or, is it something qualitatively different? What are the promises of big data? From which direction should a manager anticipate threats?

  The tendency of the media to hype new and barely understood phenom- ena makes it difficult to evaluate new technologies, along with the nature and extent of their significance. This book argues that big data is new and possesses strategic significance. The argument the authors make about big and is itself comprehensible. Although it is comprehensible, it is not easy to use and it can deliver misleading or incorrect results. However, these erroneous results are not often random. They result from certain statisti- cal and data-related phenomena. Knowing these phenomena are real and understanding how they function enable you as a manager to become a better user of your big data system.

  Like cell phones and e-mail, big data is a recent phenomenon that has emerged as a part of the panorama of our daily lives. When you shop

  Introduction • 3 articles referencing database searches, and receive unsolicited coupons, you interact with big data. Many readers, as participants in a store’s loyalty program, possess a key fob featuring a bar code on one side and the logo of a favorite store on the other. One of the primary rationales of these pro- grams, aside from decreasing your incentive to shop elsewhere, is to gather data on the company’s most important customers. Every time you swipe your key fob or enter your phone number into the keypad of the credit card machine while you are checking out at the cash register, you are tying a piece of identifying data (who you are) with which items you purchased, how many items you purchased, what time of day you were shopping, and other data. From these, analysts can determine whether you shop by brand or buy whatever is on sale, whether you are purchasing different items from before (suggesting a life change), and whether you have stopped making your large purchases in the store and now only drop in for quick items such as milk or sugar. In the latter case, that is a sign you switched to another retailer for the bulk of your shopping and coupons or some other interven- tion may be in order. Stores have long collected customer data, long before the age of big data, but they now possess the ability to pull in a greater vari- ety of data and conduct more powerful analyses of the data.

  Big data influences us less obviously—it informs the obscure underpin- nings of our society, such as manufacturing, transportation, and energy. Any industry developing enormous quantities of diverse data is ready for big data. In fact, these industries probably use big data already. The technological revolution occurring in data analytics enables more precise allocation of resources in our evolving economy—much as the revolution in navigational technology, from the superseded sextant to modern GPS devices, enabled ships to navigate open seas.

  Big data is much like the Internet—it has drawbacks, but its net value is positive. The debate on big data, like political debate, tends toward mis- cal debates, almost never lies in those absolutes. Like a car, you do not start up a big data solution and let it motor along unguided—you drive it, you guide it, and you extract value from it.

  Data itself is now an asset, one for companies to secure and hoard, much as the Federal Reserve Bank of New York stockpiles gold (though, for the sake of accuracy, the Federal Reserve only stores gold for countries other than the United States). Companies invest in systems to organize and extract value from their data, just as they would a piece of land or reserve

  4 • Big Data Analytics

  IHS, Experian, and DataLogix, build entire businesses to collect, refine, and sell data. Companies in the business of data are diverse. IHS provides information about specific industries such as energy, whereas Experian and DataLogix provide personal information about individual consum- ers. These companies would not exist if the exchange of data was not lucra- tive. They would enjoy no profit motive if they could not use data to make more money than the cost of its generation, storage, and analysis.

  One of your authors was a devotee of Borders, the book retailer (and still keeps his loyalty program card on display as a memorial to the com- pany). After the liquidation of Borders, he received an e-mail message from William Lynch, the chief executive of Barnes & Noble (another favorite store), stating in part, “As part of Borders ceasing operations, we acquired some of its assets including Borders brand trademarks and their customer list. The subject matter of your DVD and other video purchases will be part of the transferred information… If you would like to opt-out, we will ensure all your data we receive from Borders is disposed of in a secure and confi- dential manner.” The data that Borders accumulated were a real asset sold off after its bankruptcy.

  Data analysis has even entered popular culture in the form of Michael Lewis’s book Moneyball, as well as the eponymous movie. The story cen- ters on Billy Beane, who used data to supplant intuition and turned the Oakland Athletics into a winning team. The relationship between data and decision making is, in fact, the key theme of this book.

GROWING INTEREST IN DECISION MAKING

  Any business book of value must answer a simple, two-word question: “So what?” So, why does big data matter? The answer is the confluence of two factors. The first is that awareness of the limitations of human intuition, also known as “gut feel,” has become obvious. The second is that big data technologies have reached the level of maturity necessary to make stun- ning computational feats affordable. Moreover, this computational ability is now visible to the general public. Facebook, Amazon.com, and search engines such as Bing, Yahoo!, and Google are prime examples. Even tradi- tional “brick-and-mortar” stores match powerful websites with analytics that would have been unimaginable 20 years ago. Barnes & Noble, Wal- Mart, and Home Depot are excellent examples.

  Introduction • 5 Many prominent actors in psychology, marketing, and behavioral finance have pointed out the flaws in human decision making. Psychologist

  Daniel Kahneman won the Nobel Memorial Prize in Economic Sciences in 2002 for his work on the systematic flaws in the way people weigh risk and reward in arriving at decisions. Building on Kahneman’s work, a vari- ety of scholars, including Dan Ariely, Ziv Carmon, and Cass Sunstein, demonstrated how hidden influencers and mental heuristics influence decision making. One of the authors had the pleasure of studying under Mr. Carmon at INSEAD and, during a class exercise, pointed out how much he preferred one ketchup sample to another—only to discover they came from the same bottle and were merely presented as being different. The difference between the two samples was nonexistent, but the differ- ence with taste perceptions was quite real.

  In fact, Mr. Ariely, Mr. Carmon, and their coauthors won the following 2008 Ig Nobel award:

  

MEDICINE PRIZE. Dan Ariely of Duke University (USA), Rebecca L.

Waber of MIT (USA), Baba Shiv of Stanford University (USA), and Ziv

Carmon of INSEAD (Singapore) for demonstrating that high-priced fake

1 medicine is more effective than low-priced fake medicine.

  The website states, “The Ig Nobel Prizes honor achievements that first make people laugh, and then makes them think. The prizes are intended to celebrate the unusual, honor the imaginative—and spur people’s inter-

  2