Design of Parallel Systems
18.8 Design of Parallel Systems
So far this chapter has concentrated on parallelization of data storage and of query processing. Since large-scale parallel database systems are used primarily for storing large volumes of data, and for processing decision-support queries on those data, these topics are the most important in a parallel database system. Parallel loading of data from external sources is an important requirement, if we are to handle large volumes of incoming data.
A large parallel database system must also address these availability issues: • Resilience to failure of some processors or disks.
• Online reorganization of data and schema changes. We consider these issues here.
816 Chapter 18 Parallel Databases
With a large number of processors and disks, the probability that at least one processor or disk will malfunction is significantly greater than in a single- processor system with one disk. A poorly designed parallel system will stop functioning if any component (processor or disk) fails. Assuming that the proba- bility of failure of a single processor or disk is small, the probability of failure of the system goes up linearly with the number of processors and disks. If a single processor or disk would fail once every 5 years, a system with 100 processors would have a failure every 18 days.
Therefore, large-scale parallel database systems, such as Teradata, and IBM Informix XPS , are designed to operate even if a processor or disk fails. Data are replicated across at least two processors. If a processor fails, the data that it stored can still be accessed from the other processors. The system keeps track of failed processors and distributes the work among functioning processors. Requests for data stored at the failed site are automatically routed to the backup sites that store a replica of the data. If all the data of a processor A are replicated at a single processor B, B will have to handle all the requests to A as well as those to itself, and that will result in B becoming a bottleneck. Therefore, the replicas of the data of a processor are partitioned across multiple other processors.
When we are dealing with large volumes of data (ranging in the terabytes), simple operations, such as creating indices, and changes to schema, such as adding a column to a relation, can take a long time—perhaps hours or even days. Therefore, it is unacceptable for the database system to be unavailable while such operations are in progress. Most database systems allow such operations to be performed online , that is, while the system is executing other transactions.
Consider, for instance, online index construction . A system that supports this feature allows insertions, deletions, and updates on a relation even as an index is being built on the relation. The index-building operation therefore cannot lock the entire relation in shared mode, as it would have done otherwise. Instead, the process keeps track of updates that occur while it is active and incorporates the changes into the index being constructed. (Most database systems today support online index construction, since this feature is very important even for non-parallel database systems.)
In recent years, a number of companies have developed new parallel database products, including Netezza, DATA llegro (which was acquired by Microsoft), Greenplum, and Aster Data. Each of these products runs on systems containing tens to thousands of nodes, with each node running an instance of an underlying database; Each product manages the partitioning of data, as well as parallel processing of queries, across the database instances.
Netezza, Greenplum and Aster Data use P ostgre SQL as the underlying data- base; DATA llegro originally used Ingres as the underlying database system, but moved to SQL S erver subsequent to its acquisition by Microsoft. By building on top of an existing database system, these systems are able to leverage the data stor- age, query processing, and transaction management features of the underlying database, leaving them free to focus on data partitioning (including replication for fault tolerance), fast interprocessor communication, parallel query process- ing, and parallel-query optimization. Another benefit of using a public domain
18.9 Parallelism on Multicore Processors 817
database such as P ostgre SQL is that the software cost per node is very low; in contrast commercial databases have a significant per-processor cost.
It is also worth mentioning that Netezza and DATA llegro actually sell data warehouse “appliances”, which include hardware and software, allowing cus- tomers to build parallel databases with minimal effort.
Parts
» Indian Institute of Technology, Bombay
» Data Mining and Information Retrieval
» Structure of Relational Databases
» Database Schema When we talk about a database, we must differentiate between the database
» Basic Structure of SQL Queries
» Modification of the Database
» • Embedded SQL : Like dynamic SQL , embedded SQL provides a means by
» Advanced Aggregation Features**
» The Cartesian-Product Operation
» The Tuple Relational Calculus
» The Entity-Relationship Model
» • For an n-ary relationship set with an arrow on one of its edges, the primary
» Entity-Relationship Design Issues
» Representation of Generalization
» Alternative Notations for Modeling Data
» Other Aspects of Database Design
» Features of Good Relational Designs
» Atomic Domains and First Normal Form
» Decomposition Using Functional Dependencies
» BCNF Decomposition Algorithm
» Decomposition Using Multivalued Dependencies
» Application Programs and User Interfaces
» Overview of Physical Storage Media
» Magnetic Disk and Flash Storage
» Organization of Records in Files
» Comparison of Ordered Indexing and Hashing
» Implementation of Pipelining
» Evaluation Algorithms for Pipelining
» Transformation of Relational Expressions
» (A, r ), the number of distinct values that appear in the relation r for attribute
» Advanced Topics in Query Optimization**
» Transaction Atomicity and Durability
» Transaction Isolation and Atomicity
» Implementation of Isolation Levels
» Transactions as SQL Statements
» Weak Levels of Consistency in Practice
» Concurrency in Index Structures**
» Failure with Loss of Nonvolatile Storage
» Early Lock Release and Logical Undo Operations
» Centralized and Client – Server Architectures
» Parallelism on Multicore Processors
» Recovery and Concurrency Control
» Distributed Query Processing
» Heterogeneous Distributed Databases
» Partitioning and Retrieving Data
» Transactions and Replication
» Decision-Tree Construction Algorithm
» Relevance Ranking Using Terms
» Synonyms, Homonyms, and Ontologies
» Crawling and Indexing the Web
» Information Retrieval: Beyond Ranking of Pages
» Structured Types and Inheritance in SQL
» Array and Multiset Types in SQL
» Application Program Interfaces to XML
» Native Storage within a Relational Database
» Other Issues in Application Development
» Representation of Geographic Data
» Transaction-Processing Monitors
» Real-Time Transaction Systems
» PostgreSQL Implementation of MVCC
» Database Design and Querying Tools
» Database Administration Tools
» Business Intelligence Features
Show more