Getting Started with NoSQL Free ebook download Free ebook download

  

Getting Started with NoSQL

Your guide to the world and technology of NoSQL

Gaurav Vaish

  BIRMINGHAM - MUMBAI

  Getting Started with NoSQL

  Copyright © 2013 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.

  First published: March 2013 Production Reference: 1150313 Published by Packt Publishing Ltd.

  Livery Place

  35 Livery Street Birmingham B3 2PB, UK.

  ISBN 978-1-84969-4-988

  www.packtpub.com william.kewley@kbbs.ie

  Cover Image by Will Kewley ( )

  Credits Author

  Gaurav Vaish Reviewer

  Satish Kowkuntla Acquisition Editor

  Robin de Jonh Commissioning Editor

  Maria D’souza Technical Editors

  Worrell Lewis Varun Pius Rodrigues Project Coordinator

  Amigya Khurana Proofreader

  Elinor Perry-Smith Indexer

  Rekha Nair Graphics

  Aditi Gajjar Production Coordinator

  Pooja Chiplunkar Cover Work

  Pooja Chiplunkar About the Author

Gaurav Vaish works as Principal Engineer with Yahoo! India. He works primarily

  in three domains—cloud, web, and devices including mobile, connected TV, and the like. His expertise lies in designing and architecting applications for the same. Gaurav started his career in 2002 with Adobe Systems India working in their engineering solutions group. In 2005, he started his own company Edujini Labs focusing on corporate training and collaborative learning. He holds a B. Tech. in Electrical Engineering with specialization in Speech Signal Processing from IIT Kanpur. He runs his personal blog at

  www.mastergaurav.com

  and

  www.m10v.com .

  This book would not have been complete without support from my wife, Renu, who was a big inspiration in writing. She ensured that after a day’s hard work at the office when I sat down to write the book, I was all charged up. At times, when I wanted to take a break off, she pushed me to completion by keeping a tab on the schedule. And she ensured me great food or a cup of tea whenever I needed it.

  This book would not have the details that I have been able to provide had it not been timely and useful inputs from Satish Kowkuntla, Architect at Yahoo! He ensured that no relevant piece of information was missed out. He gave valuable insights to writing the correct language keeping the reader in mind. Had it not been for him, you may not have seen the book in the shape that it is in.

  About the Reviewer Satish Kowkuntla is a software engineer by profession with over 20 years of

  experience in software development, design, and architecture. Satish is currently working as a software architect at Yahoo! and his experience is in the areas of web technologies, frontend technologies, and digital home technologies. Prior to Yahoo! Satish has worked in several companies in the areas of digital home technologies, system software, CRM software, and engineering CAD software. Much of his career has been in Silicon Valley.

   Support files, eBooks, discount offers and more

  You might want to visit to your book. Did you know that Packt offers eBook versions of every book published, with PDF

  

and as a print book customer, you are entitled to a discount on the eBook copy.

for more details.

   , you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. TM

  

  Do you need instant solutions to your IT questions? PacktLib is Packt’s online digital book library. Here, you can access, read and search across Packt’s entire library of books.

  Why Subscribe?

  • Fully searchable across every book published by Packt • Copy and paste, print and bookmark content
  • On demand and accessible via web browser

  Free Access for Packt account holders

   PacktLib today and view nine entirely free books. Simply use your login credentials for immediate access.

  Dedicated to Renu Chandel, my wife.

  Table of Contents

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  Table of Contents [

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

]

  

  

  

  

  

  

  

  

  

  

  

  

  

  Table of Contents

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

  

[ ]

  Preface

  This book takes a deep dive in NoSQL as technology providing a comparative study on the data models, the products in the market, and with RDBMS using scenario-driven case studies Relational databases have been used to store data for decades while SQL has been the de-facto language to interact with RDBMS. In the last few years, NoSQL has been a growing choice especially for large, web-scale applications. Non-relational databases provide the scale and speed that you may need for your application.

  However, making a decision to start with or switch to NoSQL requires more insights than a few benchmarks—knowing the options at hand, advantages and drawbacks, scenarios where it suits the most, and where it should be avoided are very critical to making a decision.

  This book is a from-the-ground-up guide that takes you from the very definition to a real-world application. It provides you step-by-step approach to design and implement a NoSQL application that will help you make clear decisions on database choice, database model choice, and the related parameters. The book is suited for a developer, an architect, as well as a CTO.

  What this book covers

  1, Overview and Architecture, gives you a head-start into NoSQL. It helps you

  Chapter

  understand what NoSQL is and is not, and also provides you with insights into the question – "Why NoSQL?"

Chapter 2 , Characteristics of NoSQL, takes a dig into the RDBMS problems that NoSQL attempts to solve and substantiates it with a concrete scenario.

  Preface

  Chapter 3, NoSQL Storage Types, explores various storage types available in the

  market today with a deep dive – comparing and contrasting them, and identifying what to use when.

  Chapter 4, Advantages and Drawbacks, brings out the advantages and drawbacks of

  using NoSQL by taking a scenario-based approach to understand the possibilities and limitations.

  

Chapter 5, Comparative Study of NoSQL Products, does a detailed comparative study of

ten NoSQL databases on about 25 parameters, both technical and non-technical. Chapter 6, Case Study, takes you through a simple application implemented using NoSQL. It covers various scenarios possible in the application and approaches that can be used with NoSQL database.

  , Taxonomy, introduces you to the common and not-so-common terms that

  Appendix

  we come across while dealing with NoSQL. It will also enable you to read through and understand the literature available on the Internet or otherwise.

  What you need for this book

  To run the examples in the book the following software will be required:

  • Operating System—Ubuntu or any other Linux variant is preferred
  • CouchDB will be required to take a dig into document store in Chapter 3,

  NoSQL Storage Types

  • Java SDK, Eclipse, Google App Engine SDK, and Objectify will be required to cover the examples of column-oriented databases in Chapter 3, NoSQL

  Storage Types

  • Redis will be required to cover the examples of key-value store in Chapter 3,

  NoSQL Storage Types

  • Neo4J will be required to cover the examples of graph store in Chapter 3,

  NoSQL Storage Types

  • MongoDB to run through the case study covered in Chapter 3, NoSQL

  Storage Types The latest versions are preferable.

  

[ ]

  Preface Who this book is for

  This book is a great resource for someone starting with NoSQL and indispensable literature for technology decision makers—be it architect, product manager or CTO. It is assumed that you have a background in RDBMS modeling and SQL and have had exposure to at least one of the programming languages—Java or JavaScript. It is also assumed that you have at least heard about NoSQL and are interested to explore the same but nothing beyond that. You are not expected to know the meaning and purpose of NoSQL—this book provides all inputs from the groundup. Whether you are a developer or an architect or a CTO of a company, this book is an indispensable resource for you to have in your library.

  Conventions

  In this book, you will find a number of styles of text that distinguish between different kinds of information. Here are some examples of these styles, and an explanation of their meaning. Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "Do you remember the JOIN query that you wrote to collate the data across multiple tables to create your final view?" A block of code is set as follows:

  "_id": "98ef65e7-52e4-4466-bacc-3a8dc0c15c79", "firstName": "Gaurav", "lastName": "Vaish", "department": "f0adcbf5-7389-4491-9c42-f39a9d3d4c75", "homeAddress": { "_id": "fa62fd39-17f8-4a16-80d6-71a5b71d758d", "line1": "123, 45th Main" "city" : "NoSQLLand", "country": "India", "zipCode": "123456" }

  When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:

  "_id": "98ef65e7-52e4-4466-bacc-3a8dc0c15c79", "firstName": "Gaurav", "lastName": "Vaish",

  

[ ]

  Preface "department": "f0adcbf5-7389-4491-9c42-f39a9d3d4c75", "homeAddress": { "_id": "fa62fd39-17f8-4a16-80d6-71a5b71d758d", "line1": "123, 45th Main" "city" : "NoSQLLand", "country": "India", "zipCode": "123456" }

  Any command-line input or output is written as follows:

  curl –X PUT –H "Content-Type: application/json" \ New terms

  and important words are shown in bold. Words that you see on the screen, in menus or dialog boxes for example, appear in the text like this: "clicking the Next button moves you to the next screen".

  Warnings or important notes appear in a box like this.

  Tips and tricks appear like this.

  Reader feedback

  Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or may have disliked. Reader feedback is important for us to develop titles that you really get the most out of.

  feedback@packtpub.com

  To send us general feedback, simply send an e-mail to , and mention the book title via the subject of your message.

  If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide on www.packtpub.com/authors .

  Customer support

  Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

  

[ ]

  Preface Downloading the color images of this book

  We also provide you a PDF file that has color images of the screenshots/diagrams used in this book. The color images will help you better understand the changes in the output. You can download this file from http://www.packtpub.com/sites/

  default/files/downloads/5689_graphics.pdf .

  Errata

  Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you would report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.

  com/submit-errata

  , selecting your book, clicking on the errata submission form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded on our website, or added to any list of existing errata, under the Errata section of that title. Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support .

  Piracy Piracy of copyright material on the Internet is an ongoing problem across all media.

  At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works, in any form, on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

  

   pirated material. We appreciate your help in protecting our authors, and our ability to bring you valuable content.

  Questions

  You can contact us at questions@packtpub.com if you are having a problem with any aspect of the book, and we will do our best to address it.

  

[ ]

  An Overview of NoSQL

  Now that you have got this book in your hand, you must be both excited and anxious about NoSQL. In this chapter, we get a head-start on:

  • What NoSQL is
  • What NoSQL is not
  • Why NoSQL
  • A list of NoSQL databases For over decades, relational databases have been used to store what we know as structured data. The data is sub-divided into groups, referred to as tables. The tables store well-defined units of data in terms of type, size, and other constraints. Each unit of data is known as column while each unit of the group is known as

  row

  . The columns may have relationships defined across themselves, for example parent-child, and hence the name relational databases. And because consistency is one of the critical factors, scaling horizontally is a challenging task, if not impossible. About a decade earlier, with the rise of large web applications, research has poured into handling data at scale. One of the outputs of these researches is non-relational database, in general referred to as NoSQL database. One of the main problems that a NoSQL database solves is scale, among others.

  Defining NoSQL

  According to Wikipedia:

  In computing, NoSQL (mostly interpreted as "not only SQL") is a broad class of database management systems identified by its non-adherence to the widely used relational database management system model; that is, NoSQL databases are not primarily built on tables, and as a result, generally do not use SQL for data manipulation.

  An Overview of NoSQL

  The NoSQL movement began in the early years of the 21st century when the world started its deep focus on creating web-scale database. By web-scale, I mean scale to cater to hundreds of millions of users and now growing to billions of connected devices including but not limited to mobiles, smartphones, internet TV, in-car devices, and many more. Although Wikipedia treats it as "not only SQL", NoSQL originally started off as a simple combination of two words—No and SQL—clearly and completely visible in the new term. No acronym. What it literally means is, "I do not want to use SQL". To elaborate, "I want to access database without using any SQL syntax". Why? We shall explore the in a while.

  Whatever be the root phrase, NoSQL today is the term used to address to the class of databases that do not follow relational database management system (RDBMS) principles, specifically being that of ACID nature, and are specifically designed to handle the speed and scale of the likes of Google, Facebook, Yahoo, Twitter, and many more.

  History

  Before we take a deep dive into it, let us set our context right by exploring some key landmarks in history that led to the birth of NoSQL. From Inktomi, probably the first true search engine, to Google, the present world leader, the computer scientists have well recognized the limitations of the traditional and widely used RDBMS specifically related to the issues of scalability, parallelization, and cost, also noting that the data set is minimally cross-referenced as compared to the chunked, transactional data, which is mostly fed to RDBMS. Specifically, if we just take the case of Google that gets billions of requests a month across applications that may be totally unrelated in what they do but related in how they deliver, the problem of scalability is to be solved at each layer—right from data access to final delivery. Google, therefore, had to work innovatively and gave birth to a new computing ecosystem comprising of:

  • GFS: Distributed filesystem
  • Chubby: Distributed coordination system
  • MapReduce: Parallel execution system
  • Big Data: Column oriented database

  

[ ]

Chapter 1 These systems were initially described in papers released from 2003 to 2006 listed

  as follows:

  • Google File System, 2003: http://research.google.com/archive/

  gfs.html

  • Chubby, 2006: http://research.google.com/archive/chubby.html
  • MapReduce, 2004: http://research.google.com/archive/

  mapreduce.html

  • Big Data, 2006: http://research.google.com/archive/bigtable.html These and other papers led to a spike in increased activities, specially in open source, around large scale distributed computing and some of the most amazing products were born. Some of the initial products that came up included: http://lucene.
  • Lucene: Java-based indexing and search engine (

  apache.org ) • Hadoop: For reliable, scalable, distributed computing ( http://hadoop. apache.org

  )

  • Cassandra: Scalable, multi-master database with no single point of failure

  ( http://cassandra.apache.org )

  • ZooKeeper: High performance coordination service for distributed

  http://zookeeper.apache.org

  applications ( )

  • Pig: High level dataflow language and execution framework for parallel computation ( http://pig.apache.org )

  What NoSQL is and what it is not

  Now that we have a fair idea on how this side of the world evolved, let us examine at what NoSQL is and what it is not. NoSQL is a generic term used to refer to any data store that does not follow the traditional RDBMS model—specifically, the data is non-relational and it does not use SQL as the query language. It is used to refer to the databases that attempt to solve the problems of scalability and availability against that of atomicity or consistency. NoSQL is not a database. It is not even a type of database. In fact, it is a term used to

  

filter out (read reject) a set of databases out of the ecosystem. There are several distinct

  family trees available. In Chapter 4, Advantages and Drawbacks, we explore various types of data models (or simply, database types) available under this umbrella.

  

[ ]

  An Overview of NoSQL

  Traditional RDBMS applications have focused on ACID transactions: • Atomicity: Everything in a transaction succeeds lest it is rolled back.

  • Consistency: A transaction cannot leave the database in an inconsistent state.
  • Isolation: One transaction cannot interfere with another.
  • Durability: A completed transaction persists, even after applications restart. Howsoever indispensible these qualities may seem, they are quite incompatible with availability and performance on applications of web-scale. For example, if a company like Amazon were to use a system like this, imagine how slow it would be. If I proceed to buy a book and a transaction is on, it will lock a part of the database, specifically the inventory, and every other person in the world will have to wait until I complete my transaction. This just doesn’t work! Amazon may use cached data or even unlocked records resulting in inconsistency. In an extreme case, you and I may end up buying the last copy of a book in the store with one of us finally receiving an apology mail. (Well, Amazon definitely has a much better system than this). The point I am trying to make here is, we may have to look beyond ACID to

  something called BASE, coined by Eric Brewer:

  • Basic availability: Each request is guaranteed a response—successful or failed execution.
  • Soft state: The state of the system may change over time, at times without any input (for eventual consistency).
  • Eventual consistency: The database may be momentarily inconsistent but will be consistent eventually.

  Eric Brewer also noted that it is impossible for a distributed computer system to provide consistency, availability and partition tolerance simultaneously. This is more commonly referred to as the CAP theorem. Note, however, that in cases like stock exchanges or banking where transactions are critical, cached or state data will just not work. So, NoSQL is, definitely, not a solution to all the database related problems

  

[ ]

Chapter 1 Why NoSQL? Looking at what we have explored so far, does it mean that we should look at NoSQL only when we start reaching the problems of scale? No. NoSQL databases have a lot more to offer than just solving the problems of scale

  which are mentioned as follows:

  • Schemaless data representation: Almost all NoSQL implementations offer schemaless data representation. This means that you don’t have to think too far ahead to define a structure and you can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
  • Development time: I have heard stories about reduced development time because one doesn’t have to deal with complex SQL queries. Do you

  JOIN

  remember the query that you wrote to collate the data across multiple tables to create your final view?

  • Speed: Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of milliseconds—especially over mobile and other intermittently connected devices—you have much higher probability of winning users over.
  • Plan ahead for scalability: You read it right. Why fall into the ditch and then try to get out of it? Why not just plan ahead so that you never fall into one. Or in other words, your application can be quite elastic—it can handle sudden spikes of load. Of course, you win users over straightaway.

  List of NoSQL Databases

  The buzz around NoSQL still hasn’t reached its peak, at least to date. We see more offerings in the market over time. The following table is a list of some of the more mature, popular, and powerful NoSQL databases segregated by data model used:

  Document Key-Value

  XML Column Graph MongoDB Redis BaseX BigTable Neo4J CouchDB Membase eXist Hadoop / FlockDB

  HBase RavenDB Voldemort Cassandra InfiniteGraph Terrastore MemcacheDB SimpleDB Cloudera

  

[ ]

  An Overview of NoSQL

  This list is by no means comprehensive, nor does it claim to be. One of the positive points about this list is that most of the databases in the list are open source and community driven.

  Chapter 4 , Advantages and Drawbacks, provides an in-depth study of the various popular data models used in NoSQL databases.

  , Case Study, does an exhaustive comparison of some of these databases

  Chapter 6

  along various key parameters including, but not limited to, data model, language, performance, license, price, community, resources, extensibility, and many more.

  Summary

  In this chapter, we learned about the fundamentals of NoSQL—what it is all about and more critically, what it is not. We took a splash in the history to appreciate the reasons and science behind it. You are recommended to explore the web for historical events around this to take a deep dive in appreciating it.

  NoSQL is not a solution to each and every application. It is worth noting that most of the products do throw away the traditional ACID nature giving way to BASE infrastructure. Having said that, some products standout—CouchDB and Neo4j, for example, are ACID compliant NoSQL databases.

  Adopting NoSQL is not only a technological change but also change in mindset, behaviour and thought process meaning that if you plan to hire a developer to work with NoSQL, he/she must understand the new models.

  In the next chapter, we will have a quick look at the taxonomy and jack up our vocabulary before we dive deeply into NoSQL.

  

[ ] Characteristics of NoSQL

  For decades, software engineers have been developing applications with relational databases in mind. The literature, architectures, frameworks, and toolkits have all been written keeping in mind the relational structure between the entities. The famous entity-relationship diagrams, or more commonly known as ER

  diagrams

  , form the basis for database design. And for quite some time now, engineers have used object-relational mapping (O/RM) tools to help them model relationships—is-a, has, one-to-one, one-to-many, many-to-many, et al.—between the objects that the software architects are great at defining.

  With the new scenarios and problems at hand for the new applications, specifically for web or mobile-based social applications with a lot of user generated content, people realized that NoSQL databases would be a stronger fit than RDBMS databases. In this chapter, we explore the traditional approach towards database, the challenges presented thereby, and the solutions provided by NoSQL for these challenges. We substantiate the ecosystem with a simple application as an example.

  Application

  ACME Foods is a grocery shop that wants to automate its inventory management. In this simplistic case, the process involves keeping an up-to-date status of its inventory and escalating to procurement, if levels are low.

  Characteristics of NoSQL RDBMS approach

  The traditional approach—using RDBMS—takes the following route:

  • Identify actors: The first step in the traditional approach is to identify various actors in the application. The actors can be internal or external to the application.
  • Define models: Once the actors are identified, the next step is to create models. Typically, there is many-to-one mapping between actors and models, that is, one model may represent multiple actors.
  • Define entities: Once the models and the object-relationships—by way of inheritance and encapsulation—are defined, the next step is to define the database entities. This requires defining the tables, columns, and column types. Special care has to be taken noting that databases allow null values for any column types, whereas programming languages may not allow, databases may have different size constraints as compared to really required, or a language allows, and much more.
  • Define relationships: One of more important steps is to be able to well define the relationship between the entities. The only way to define relationships across tables is by using foreign keys. The entity relationships correspond to inheritance, one-to-one, one-to-many, many-to-many, and other object relationships.
  • Program database and application: Once these are ready, engineers program database in PL/SQL (for most databases) or PL/pgSQL (for PostgreSQL) while software engineers develop the application.
  • Iterate: Engineers may provide feedback to the architects and designers about the existing limitations and required enhancements in the models, entities, and relationships.

  Mapping the steps to our example as follows:

  • Few of the actors identified include buyer, employee, purchaser, administrator, office address, shipping address, supplier address, item in inventory, and supplier.

  UserProfile

  • They may be mapped to a model and there may be subclasses as required— Administrator and PointOfSalesUser . Some of the other models include Department , Role , Product , Supplier , Address ,

  PurchaseOrder Invoice , and .

  • Simplistically, a database table may map each actor to a model.

  

[ ]

  • Foreign keys will be used to define the object relationships—one-to-many

  Department UserProfile many-to-many Role

  between and , between and UserProfile , and PurchaseOrder and Product .

  • One would need simple SQL queries to access basic information while queries collating data across tables will need complex JOINs.
  • Based on the inputs received later in time, one or more of these may need to be updated. New models and entities may evolve over time.

  At a high level, the following entities and their relationships can be identified:

  

[ ]

  Characteristics of NoSQL

  A department contains one or more users. A user may execute one or more sales orders each of which contains one or more products and updates the inventory. Items in inventory are provided by suppliers, which are notified if inventory level drops below critical levels. Representational class diagram may be closer to the one shown in the next figure:

  These actors, models, entities, and relationships are only representative. In the real application, the definitions will

be more elaborate and relationships more dense.

  

[ ]

Chapter 2 Let us take a quick look at the code that will take us there. To start with, the models may shape as follows:

  class UserProfile { int _id; UserType type; String firstName; String lastName; Department department; Collection<Role> roles; Address homeAddress; Address officeAddress; } class Address { String _id; String line1; String line2; String city; Country country; String zipCode; } enum Country { Australia, Bahrain, Canada, India, USA }

  The SQL statements used to create the tables for the previous models are:

  CREATE TABLE Address( _id INT NOT NULL AUTO_INCREMENT, line1 VARCHAR(64) NOT NULL, line2 VARCHAR(64), city VARCHAR(32) NOT NULL, country VARCHAR(24) NOT NULL, /* Can be normalized */ zipCode VARCHAR(8) NOT NULL, PRIMARY_KEY (_id) );

  

[ ]

  Characteristics of NoSQL [ ]

  CREATE TABLE UserProfile( _id INT NOT NULL AUTO_INCREMENT, firstName VARCHAR(32) NOT NULL, lastName VARCHAR(32) NOT NULL DEFAULT '', departmentId INT NOT NULL, homeAddressId INT NOT NULL, officeAddressId INT NOT NULL, PRIMARY_KEY (_id), FOREIGN_KEY (officeAddressId) REFERENCES Address(_id), FOREIGN_KEY (homeAddressId) REFERENCES Address(_id) );

The previous definitions are only

representative but give an idea of what it requires to work in RDMBS world.

  Challenges

  The aforementioned approach sounds great, however, it has a set of challenges. Let us explore some of the possibilities that ACME Foods has or may encounter in future:

  • The technical team faces a churn and key people maintaining the database—schema, programmability, business continuity process a.k.a. availability, and other aspects—leave. The company has a new engineering team and, irrespective of its expertise, has to quickly ramp up with existing entities, relationships, and code to maintain.
  • The company wishes to expand their web presence and enable online orders.

  This requires either creating new user-related entities or enhancing the current entities.

  • The company acquires another company and now needs to integrate the two database systems. This means refining models and entities. Critically, the database table relationships have to be carefully redefined.
  • The company grows big and has to handle hundreds of millions of queries a day across the country. More so, it receives a few million orders. To scale, it has tied up with thousands of suppliers across locations and must provide away to integrate the systems.
  • The company ties up with a few or several customer facing companies and intends to supply services to them to increase their sales. For this, it must integrate with multiple systems and also ensure that its application must be able to scale up to the combined needs of these companies, especially when multiple simultaneous orders are received in depleting inventory.

  • The company plans to provide API integration for aggregators to retrieve and process their data. More importantly, it must ensure that the API must be forward compatible meaning that in future if it plans to change their internal database schema for whatever reasons, it must—if at all—minimally impact the externally facing API and schema for data-exchange.
  • The company plans to leverage social networking sites, such as Facebook,

  Twitter, and FourSquare. For this, it seeks to not only use the simple widgets provided but also gather, monitor, and analyze statistics gathered. The preceding functional requirements can be translated into the following technical requirements as far as the database is concerned:

  • Schema flexibility: This will be needed during future enhancements and integration with external applications —outbound or inbound. RDBMS are quite inflexible in their design. More often than not, adding a column is an absolute no-no, especially if the table has some data and the reason lies in the constraint of having a default value for the new column and that the existing rows, by default, will have that default value. As a result you have to scan through the records and update the values as required, even if it can be automated. It may not be complex always, but frowned upon especially when the number of rows is large or number of columns to add is sufficiently large. You end up creating new tables and increase complexity by introducing relationships across the tables.
  • Complex queries: Traditionally, the tables are designed denormalized which means that the developers end up writing complex so-called JOIN queries which are not only difficult to implement and maintain but also take substantial database resources to execute.
  • Data update: Updating data across tables is probably one of the more complex scenarios especially if they are to be a part of the transaction. Note that keeping the transaction open for a long duration hampers the performance. One also has to plan for propagating the updates to multiple nodes across the system. And if the system does not support multiple masters or writing to multiple nodes simultaneously, there is a risk of node-failure and the entire application moving to read-only mode.

  

[ ]

  Characteristics of NoSQL

  • Scalability: More often than not, the only scalability that may be required is for read operations. However, several factors impact this speed as operations grow. Some of the key questions to ask are:

  ° What is the time taken to synchronize the data across physical database instances? °

  What is the time taken to synchronize the data across datacenters? °

  What is the bandwidth requirement to synchronize data? Is the data exchanged optimized? ° What is the latency when any update is synchronized across servers? Typically, the records will be locked during an update.

  NoSQL approach

  NoSQL-based solutions provide answers to most of the challenges that we put up. Note that if ACME Grocery is very confident that it will not shape up as we discussed earlier, we do not need NoSQL. If ACME Grocery does not intend to grow, integrate, or provide integration with other applications, surely, the RDBMS will suffice. But that is not how anyone would like the business to work in the long term. So, at some point in time, sooner or later, these questions will arise. Let us see what NoSQL has to offer against each technical question that we have: • Schema flexibility: Column-oriented databases ( http://en.wikipedia.

  org/wiki/Column-oriented_DBMS ) store data as columns as opposed to

  rows in RDBMS. This allows flexibility of adding one or more columns as required, on the fly. Similarly, document stores that allow storing semi- structured data are also good options.

  • Complex queries: NoSQL databases do not have support for relationships or foreign keys. There are no complex queries. There are no JOIN statements. Is that a drawback? How does one query across tables? It is a functional drawback, definitely. To query across tables, multiple queries must be executed. Database is a shared resource, used across application servers and must not be released from use as quickly as possible. The options involve combination of simplifying queries to be executed, caching data, and performing complex operations in application tier.

  

[ ]

Chapter 2 A lot of databases provide in-built entity-level caching. This means that as

  and when a record is accessed, it may be automatically cached transparently by the database. The cache may be in-memory distributed cache for performance and scale.

  • Data update: Data update and synchronization across physical instances are difficult engineering problems to solve. Synchronization across nodes within a datacenter has a different set of requirements as compared to synchronizing across multiple datacenters. One would want the latency within a couple of milliseconds or tens of milliseconds at the best. NoSQL solutions offer great synchronization options. MongoDB ( http://www.mongodb.org/display/DOCS/

  Sharding+Introduction ), for example, allows concurrent http://www.mongodb.org/display/DOCS/

  updates across nodes (

  

How+does+concurrency+work ), synchronization with conflict resolution and

  eventually, consistency across the datacenters within an acceptable time that would run in few milliseconds. As such, MongoDB has no concept of isolation. Note that now because the complexity of managing the transaction may be moved out of the database, the application will have to do some hard work. An example of this is a two-phase commit while implementing transactions ( http://docs.mongodb.org/manual/tutorial/

  perform-two-phase-commits/ ).

  Do not worry or get scared. A plethora of databases offer Multiversion

  concurrency control

  (MCC)to achieve transactional consistency

  http://en.wikipedia.org/wiki/Multiversion_concurrency_control

  ( ).

  http://www.infoq.com/

  Surprisingly, eBay does not use transactions at all (

  

interviews/dan-pritchett-ebay-architecture ). Well, as Dan Pritchett

  ( http://www.addsimplicity.com/ ), Technical Fellow at eBay puts it, eBay.com does not use transactions. Note that PayPal does use transactions.

  • Scalability: NoSQL solutions provider greater scalability for obvious reasons. A lot of complexity that is required for transaction oriented RDBMS does not exist in ACID non-compliant NoSQL databases. Interestingly, since NoSQL do not provide cross-table references and there are no JOIN queries possible, and because one cannot write a single query to collate data across multiple tables, one simple and logical solution is to—at times—duplicate the data across tables. In some scenarios, embedding the information within the primary entity—especially in one-to-one mapping cases—may be a great idea.

  [ ]

  Characteristics of NoSQL

  Revisiting our earlier case of Address and UserProfile , if we use the document store, we can use JSON format to structure the data so that we do not need cross- table queries at all. An example of how the data may look like is given as follows:

  //UserProfile { "_id": "98ef65e7-52e4-4466-bacc-3a8dc0c15c79", "firstName": "Gaurav", "lastName": "Vaish", "department": "f0adcbf5-7389-4491-9c42-f39a9d3d4c75", "homeAddress": { "_id": "fa62fd39-17f8-4a16-80d6-71a5b71d758d", "line1": "123, 45th Main" "city" : "NoSQLLand", "country": "India", "zipCode": "123456" } }

  We explore various NoSQL database classes—based on data models provided—in Chapter 3, NoSQL Storage Types.

  It is not that the new companies start with NoSQL straightaway. One can start with RDBMS and migrate to NoSQL—just keep in mind that it is not going to be trivial. Or better still, start with NoSQL. Even better, start with a mix of RDBMS and NoSQL. As we will see later, there are scenarios where it may be best to have a mix of the two databases.

  A big case in consideration here is that of Netflix. The company moved from Oracle RDBMS to Apache Cassandra ( http://www.slideshare.net/hluu/netflix-

  moving-to-cloud

  ), and they could achieve over a million writes per second. Yes! That is 1,000,000 writes per second ( http://techblog.netflix.com/2011/11/

  benchmarking-cassandra-scalability-on.html ) across the cluster with over

  10,000 writes per second per node while maintaining the average latency at less than 0.015 milliseconds! And the total cost of setting it all up and running on Amazon EC2 Cloud was at around $60 per hour—not per node but for a cluster of 48 nodes. Per node cost is only $1.25 per hour inclusive of the storage capacity of 12.8 Terra-bytes, network read bandwidth of 22 Mbps, and write bandwidth of 18.6Mbps.

  

[ ]

Chapter 2 The preceding case-in-hand should not undermine

  the power of and features provided by Oracle RDBMS database. I have always considered it as one of the best

commercial solutions available in RDBMS space.