A collection of independent databases on non databases on non-- networked computers
Chapter 13: Chapter 13: Chapter 13: Chapter 13: Distributed Databases Distributed Databases Modern Database Management Modern Database Management
7
7 th th
Edition Edition
7
Objectives Objectives Objectives Objectives
Definition of terms Definition of terms Definition of terms Definition of terms
Explain business conditions driving distributed databases Explain business conditions driving distributed databases
Describe salient characteristics of distributed database Describe salient characteristics of distributed database Describe salient characteristics of distributed database Describe salient characteristics of distributed database environments environments
Explain advantages and risks of distributed databases Explain advantages and risks of distributed databases
E Explain strategies and options for distributed database Explain strategies and options for distributed database E l i l i t t t t i i d d ti ti f f di t ib t d d t b di t ib t d d t b design design
Discuss synchronous and asynchronous data replication Discuss synchronous and asynchronous data replication Discuss synchronous and asynchronous data replication Discuss synchronous and asynchronous data replication and partitioning and partitioning
Discuss optimized query processing in distributed Discuss optimized query processing in distributed databases databases databases databases
Explain salient features of several distributed database Explain salient features of several distributed database management systems management systems
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Definitions Definitions Definitions Definitions
A A A A single single single single
Distributed Database: Distributed Database: Distributed Database: Distributed Database:
logical database logical database that is spread that is spread physically across computers in multiple physically across computers in multiple physically across computers in multiple physically across computers in multiple locations that are connected by a data locations that are connected by a data communications link communications link
A collection A collection A collection A collection
Decentralized Database: Decentralized Database: Decentralized Database: Decentralized Database:
of of databases on non databases on non-- independent independent networked computers networked computers networked computers networked computers
They are NOT the same thing! They are NOT the same thing!
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Reasons for Reasons for Reasons for Reasons for
Distributed Database Distributed Database
Business unit autonomy and distribution Business unit autonomy and distribution
Data sharing Data sharing
Data communication costs Data communication costs Data communication costs Data communication costs
Data communication reliability and costs Data communication reliability and costs
Multiple application vendors Multiple application vendors
Database recovery Database recovery Database recovery Database recovery
Transaction and analytic processing Transaction and analytic processing y y p p g g
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall Figure 13-1 – Distributed database environments (adapted from Bell and Grimson, 1992) Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Distributed Database Options Distributed Database Options Distributed Database Options Distributed Database Options
Homogeneous -- Same DBMS at each node Homogeneous Same DBMS at each node
Autonomous -- I ndependent DBMSs Autonomous I ndependent DBMSs
Non--autonomous Non autonomous -- Central, coordinating DBMS Central, coordinating DBMS
Easy to manage, difficult to enforce Easy to manage, difficult to enforce
Heterogeneous Heterogeneous -- Different DBMSs at different H t H t Diff Different DBMSs at different Diff t DBMS t DBMS t diff t diff t t nodes nodes
Systems – Systems S t S t – With full or partial DBMS functionality With f ll With f ll With full or partial DBMS functionality ti l DBMS f ti l DBMS f ti ti lit lit
Gateways Gateways -- Simple paths are created to other Simple paths are created to other databases without the benefits of one logical databases without the benefits of one logical databases without the benefits of one logical databases without the benefits of one logical database database
Difficult to manage, preferred by independent Difficult to manage, preferred by independent organizations organizations
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Distributed Database Options (cont.) Distributed Database Options (cont.) p ( ) p ( )
- Supports some or all functionality of Supports some or all functionality of one logical database one logical database
Systems Systems
Full DBMS Functionality Full DBMS Functionality -- All distributed DB functions All distributed DB functions
Partial Partial--Multi database Multi database -- Some distributed DB functions Some distributed DB functions
Federated Federated -- Supports local databases for unique data requests Supports local databases for unique data requests
Loose I ntegration Loose I ntegration -- Local dbs have their own schemas Local dbs have their own schemas
Tight I ntegration Tight I ntegration -- Local dbs use common schema Local dbs use common schema
Tight I ntegration Tight I ntegration -- Local dbs use common schema Local dbs use common schema
Unfederated Unfederated -- Requires all access to go through a central, Requires all access to go through a central, coordinating module coordinating module Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Homogeneous, Non Homogeneous, Non Homogeneous Non-- Homogeneous Non
Autonomous Database Autonomous Database
Data is distributed across all the nodes Data is distributed across all the nodes
Same DBMS at each node Same DBMS at each node
All data is managed by the distributed All data is managed by the distributed All data is managed by the distributed All data is managed by the distributed DBMS (no exclusively local data) DBMS (no exclusively local data)
All access is through one, global schema All access is through one, global schema
The global schema is the The global schema is the The global schema is the The global schema is the of all the of all the of all the of all the
union union union union local schema local schemaChapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Figure 13-2: Homogeneous Database Identical DBMSs
Typical Heterogeneous Typical Heterogeneous Typical Heterogeneous Typical Heterogeneous Environment Environment
Data distributed across all the nodes Data distributed across all the nodes
Different DBMSs may be used at each Different DBMSs may be used at each node node node node
Local access is done using the local DBMS Local access is done using the local DBMS
and schema and schema d d h h Remote access is done using the global Remote access is done using the global Remote access is done using the global Remote access is done using the global schema schema Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Figure 13-3: Typical Heterogeneous Environment Non-identical DBMSs
Major Objectives Major Objectives Major Objectives Major Objectives
Location Transparency Location Transparency
User does not have to know the location of the User does not have to know the location of the data data
Data requests automatically forwarded to Data requests automatically forwarded to appropriate sites appropriate sites
Local Autonomy Local Autonomy
Local site can operate with its database when Local site can operate with its database when network connections fail network connections fail
Each site controls its own data, security, logging, Each site controls its own data, security, logging, recovery recovery
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Significant Trade Significant Trade--Offs Offs g g
Synchronous Synchronous
Some data inconsistency is tolerated Some data inconsistency is tolerated y y
NOTE: all this assumes replicated data (to be discussed later)
faster response time faster response time
Less overhead Less overhead
Lower data integrity Lower data integrity
Data update propagation is delayed Data update propagation is delayed
Distributed Database Distributed Database
Distributed Database Distributed Database
High overhead High overhead slow response times slow response times g g
Good for data integrity Good for data integrity
Data updates are immediately applied to all copies Data updates are immediately applied to all copies throughout network throughout network oug ou e o oug ou e o
All copies of the same data are always identical All copies of the same data are always identical
All copies of the same data are always identical All copies of the same data are always identical
Asynchronous Asynchronous
Advantages of Advantages of
Distributed Database over Distributed Database overCentralized Databases Centralized Databases Centralized Databases Centralized Databases
I ncreased reliability/ availability I ncreased reliability/ availability
Local control over data Local control over data
Modular growth Modular growth
Modular growth Modular growth
Lower communication costs Lower communication costs
Faster response for certain queries Faster response for certain queries
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Disadvantages of Disadvantages of Disadvantages of Disadvantages of
Distributed Database Distributed Database Compared to Compared toCentralized Databases Centralized Databases
S f d l S f d l
Software cost and complexity Software cost and complexity
Processing overhead Processing overhead
Processing overhead Processing overhead
Data integrity exposure Data integrity exposure
Slower response for certain queries Slower response for certain queries
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Options for Options for Options for Options for
Distributing a Database Distributing a Database g g Data replication Data replication
Copies of data distributed to different sites Copies of data distributed to different sites C C i i f d t di t ib t d t f d t di t ib t d t diff diff t it t it
Horizontal partitioning Horizontal partitioning
Different rows of a table distributed to different sites Different rows of a table distributed to different sites Diff Diff t t f f t bl di t ib t d t t bl di t ib t d t diff diff t it t it
Vertical partitioning Vertical partitioning
Diff Diff Different columns of a table distributed to different Different columns of a table distributed to different t t l l f f t bl di t ib t d t t bl di t ib t d t diff diff t t sites sites
Combinations of the above Combinations of the above Combinations of the above Combinations of the above
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Data Replication Data Replication
Advantages: Advantages:
Reliability Reliability y y
Fast response Fast response
May avoid complicated distributed transaction May avoid complicated distributed transaction integrity routines (if replicated data is refreshed at integrity routines (if replicated data is refreshed at ( f ( f l l d d d d f f h d h d scheduled intervals) scheduled intervals)
Decouples nodes (transactions proceed even if Decouples nodes (transactions proceed even if Decouples nodes (transactions proceed even if Decouples nodes (transactions proceed even if some nodes are down) some nodes are down)
Reduced network traffic at prime time (if updates Reduced network traffic at prime time (if updates p p ( ( p p can be delayed) can be delayed)
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Data Replication (cont.) Data Replication (cont.) Data Replication (cont ) Data Replication (cont )
Disadvantages: Disadvantages:
Additional requirements for storage space Additional requirements for storage space Additional requirements for storage space Additional requirements for storage space
Additional time for update operations Additional time for update operations
Complexity and cost of updating Complexity and cost of updating Complexity and cost of updating Complexity and cost of updating
I ntegrity exposure of getting incorrect data I ntegrity exposure of getting incorrect data if replicated data is not updated if if if replicated data is not updated li li t d d t t d d t i i t t d t d d t d simultaneously simultaneously
Therefore, better when used for non--volatile Therefore, better when used for non volatile
(read--only) data (read only) data
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice HallTypes of Data Replication Types of Data Replication yp yp p p
P P Push Replication – – Push Replication h R h R li li ti ti
updating site sends changes updating site sends changes updating site sends changes updating site sends changes to other sites to other sites
Pull Replication – Pull Replication receiving sites control when receiving sites control when update messages will be update messages will be update messages will be update messages will be processed processed
- –
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Types of Push Replication Types of Push Replication Types of Push Replication Types of Push Replication
Broadcast update orders without requiring Broadcast update orders without requiring
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Update messages stored in message queue until Update messages stored in message queue until processed by receiving site processed by receiving site
Done through use of triggers Done through use of triggers
Done through use of triggers Done through use of triggers
Broadcast update orders without requiring Broadcast update orders without requiring confirmation confirmation
Near Real Near Real--Time Replication Time Replication --
Snapshot Replication Snapshot Replication -- p p p p
Dynamic vs. shared update ownership Dynamic vs. shared update ownership
Dynamic vs. shared update ownership Dynamic vs. shared update ownership
Full or differential (incremental) snapshots Full or differential (incremental) snapshots
Master collects updates in log Master collects updates in log Master collects updates in log Master collects updates in log
Changes periodically sent to master site Changes periodically sent to master site
processed by receiving site processed by receiving site
I ssues for Data Replication I ssues for Data Replication
Data timeliness – Data timeliness – high tolerance for out high tolerance for out--of of--date date data may be required data may be required data may be required data may be required
DBMS capabilities DBMS capabilities – – if DBMS cannot support if DBMS cannot support multi--node queries, replication may be necessary multi multi node queries replication may be necessary multi node queries, replication may be necessary node queries replication may be necessary
Performance implications – Performance implications – refreshing may refreshing may cause performance problems for busy nodes cause performance problems for busy nodes cause performance problems for busy nodes cause performance problems for busy nodes
Network heterogeneity – Network heterogeneity – complicates replication complicates replication
N t N t Network communication capabilities – Network communication capabilities k k i i ti ti biliti biliti – complete complete l t l t
refreshes place heavy demand on refreshes place heavy demand on telecommunications telecommunications telecommunications telecommunicationsChapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
H i t l P titi i H i t l P titi i Horizontal Partitioning Horizontal Partitioning
Unions across partitions Unions across partitions ease of query ease of query
No data replication No data replication backup vulnerability backup vulnerability
inconsistent inconsistent d d access speed access speed
Accessing data across partitions Accessing data across partitions
Disadvantages Disadvantages
Different rows of a table at different sites Different rows of a table at different sites
Only relevant data is available Only relevant data is available security security
better performance better performance
Local access optimization Local access optimization
Data stored close to where it is used Data stored close to where it is used efficiency efficiency
Advantages Advantages -- g g
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Vertical Partitioning Vertical Partitioning
Different columns of a table at different Different columns of a table at different
sites sites Advantages and disadvantages are the Advantages and disadvantages are the Advantages and disadvantages are the Advantages and disadvantages are the same as for horizontal partitioning except same as for horizontal partitioning except that combining data across partitions is that combining data across partitions is that combining data across partitions is that combining data across partitions is more difficult because it requires joins more difficult because it requires joins (instead of unions) (instead of unions)
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Figure 13-6
Distributed processing system for a manufacturing company Distributed processing system for a manufacturing company
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Five Distributed Database Five Distributed Database Five Distributed Database Five Distributed Database
Organizations Organizations g g Centralized database, distributed access Centralized database, distributed access Replication with periodic snapshot update Replication with periodic snapshot update Replication with near real Replication with near real time Replication with near real Replication with near real--time time time synchronization of updates synchronization of updates
Partitioned, one logical database Partitioned, one logical database Partitioned, independent, nonintegrated Partitioned, independent, nonintegrated Partitioned independent nonintegrated Partitioned independent nonintegrated segments segments
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Factors in Choice of Factors in Choice of Factors in Choice of Factors in Choice of
Distributed Strategy Distributed Strategy gy gy
Funding, autonomy, security Funding, autonomy, security
Site data referencing patterns Site data referencing patterns
Growth and expansion needs Growth and expansion needs Growth and expansion needs Growth and expansion needs
Technological capabilities Technological capabilities
Costs of managing complex technologies Costs of managing complex technologies
Need for reliable service Need for reliable service Need for reliable service Need for reliable service
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Table 13-1: Distributed Design Strategies Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Distributed DBMS Distributed DBMS requires requires
Distributed database Distributed database distributed DBMS distributed DBMS
Functions of a distributed DBMS: Functions of a distributed DBMS:
Locate data with a Locate data with a distributed data dictionary distributed data dictionary
Determine location from which to retrieve data and Determine location from which to retrieve data and process query components process query components
DBMS translation between nodes with different local DBMS translation between nodes with different local DBMSs (using DBMSs (using ))
middleware middleware
Data consistency (via Data consistency (via multiphase commit protocols multiphase commit protocols ))
Global primary key control Global primary key control
Scalability Scalability
Security, concurrency, query optimization, failure recovery Security, concurrency, query optimization, failure recovery
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Figure 13-10: Distributed DBMS architecture Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Local Transaction Steps Local Transaction Steps
1.2. Distributed DBMS checks distributed Distributed DBMS checks distributed data repository for location of data. data repository for location of data. data repository for location of data data repository for location of data Finds that it is Finds that it is local local
3.
3 3.
3 Distributed DBMS sends request to local Distributed DBMS sends request to local Distributed DBMS sends request to local Distributed DBMS sends request to local
4. Local DBMS processes request Local DBMS processes request l l S S 5.
5. Local DBMS sends results to application Local DBMS sends results to application pp pp
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Figure 13-10: Distributed DBMS Architecture (cont.) (showing local transaction steps)
2
1
3
5
4 Local transaction – all data stored locally Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Global Transaction Steps Global Transaction Steps
1.Local DBMS at emote site sends es lts to dist ib ted Local DBMS at emote site sends es lts to dist ib ted 6.
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
8. Distributed DBMS at originating site sends results to Distributed DBMS at originating site sends results to
originating site originating site 8.
7. Remote distributed DBMS sends results back to Remote distributed DBMS sends results back to
7 Remote distributed DBMS sends results back to Remote distributed DBMS sends results back to 7.
7
DBMS at remote site DBMS at remote site
6. Local DBMS at remote site sends results to distributed Local DBMS at remote site sends results to distributed
5. Local DBMS at remote site processes request Local DBMS at remote site processes request
1. Application makes request to distributed DBMS Application makes request to distributed DBMS
its local DBMS if necessary, and sends request to local its local DBMS if necessary, and sends request to local DBMS DBMS DBMS DBMS 5.
4. Distributed DBMS at remote site translates request for Distributed DBMS at remote site translates request for
3. Distributed DBMS routes request to remote site Distributed DBMS routes request to remote site 4.
3. Distributed DBMS routes request to remote site Distributed DBMS routes request to remote site 3.
location of data. Finds that it is location of data. Finds that it is remote remote 3.
2. Distributed DBMS checks distributed data repository for Distributed DBMS checks distributed data repository for
2 Distributed DBMS checks distributed data repository for Distributed DBMS checks distributed data repository for 2.
2
application application
Figure 13-10: Distributed DBMS architecture (cont.) (showing global transaction steps)
2
3
1
6
3
7
4
6
7
8
5 Global transaction – some data is at remote site(s) Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall Distributed DBMS Distributed DBMS Transparency Objectives Transparency Objectives Location T anspa enc Location T anspa enc
Location Transparency Location Transparency
User/ application does not need to know where data resides User/ application does not need to know where data resides
R li ti T R li ti T
Replication Transparency Replication Transparency
User/ application does not need to know about duplication User/ application does not need to know about duplication F il T F il T
Failure Transparency Failure Transparency
Either all or none of the actions of a transaction are Either all or none of the actions of a transaction are committed committed committed committed
Each site has a transaction manager Each site has a transaction manager
Logs transactions and before and after images Logs transactions and before and after images
Logs transactions and before and after images Logs transactions and before and after images
Concurrency control scheme to ensure data integrity Concurrency control scheme to ensure data integrity
Requires special Requires special commit protocol commit protocol
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Two--Phase Commit Two Phase Commit Two Two Phase Commit Phase Commit
Prepare Phase Prepare Phase Prepare Phase Prepare Phase
Coordinator receives a commit request Coordinator receives a commit request
Coordinator instructs all resource managers to Coordinator instructs all resource managers to get ready to “go either way” on the get ready to “go either way” on the t transaction. Each resource manager writes all transaction. Each resource manager writes all t ti ti E E h h it it ll ll
updates from that transaction to its own updates from that transaction to its own
physical log physical log physical log physical log Coordinator receives replies from all resource Coordinator receives replies from all resource managers. I f all are ok, it writes commit to managers. I f all are ok, it writes commit to managers I f all are ok it writes commit to managers I f all are ok it writes commit to its own log; if not then it writes rollback to its its own log; if not then it writes rollback to its log log log log
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Two--Phase Commit (cont.) Two Phase Commit (cont.)
Commit Phase Commit Phase
Coordinator then informs each resource manager of Coordinator then informs each resource manager of its decision and broadcasts a message to either its decision and broadcasts a message to either commit or rollback (abort). I f the message is commit, commit or rollback (abort). I f the message is commit, commit or rollback (abort) commit or rollback (abort) I f the message is commit I f the message is commit then each resource manager transfers the update then each resource manager transfers the update from its log to its database from its log to its database from its log to its database from its log to its database
A failure during the commit phase puts a transaction A failure during the commit phase puts a transaction “in limbo.” This has to be tested for and handled with “in limbo.” This has to be tested for and handled with timeouts or polling timeouts or polling
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Concurrency Control Concurrency Control
Concurrency Transparency Concurrency Transparency Design goal for distributed database Design goal for distributed database
Timestamping Timestamping
Timestamping Timestamping
Concurrency control mechanism Concurrency control mechanism
Alternative to locks in distributed databases Alternative to locks in distributed databases
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Query Optimization Query Optimization
I n a query involving a multi I n a query involving a multi--site join and, possibly, a site join and, possibly, a distributed database with replicated files the distributed distributed database with replicated files the distributed distributed database with replicated files, the distributed distributed database with replicated files, the distributed DBMS must decide where to access the data and how to DBMS must decide where to access the data and how to proceed with the join. Three step process: proceed with the join. Three step process:
1
1 Query decomposition Query decomposition -- rewritten and simplified rewritten and simplified
2
2 Data localization Data localization -- query fragmented so that query fragmented so that
fragments reference data at only one site fragments reference data at only one site fragments reference data at only one site fragments reference data at only one site
3
3 Global optimization Global optimization --
Order in which to execute query fragments Order in which to execute query fragments
Order in which to execute query fragments Order in which to execute query fragments
Data movement between sites Data movement between sites
Where parts of the query will be executed Where parts of the query will be executed
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Evolution of Distributed DBMS Evolution of Distributed DBMS
“Unit of Work” -- All of a transaction’s steps. “Unit of Work” All of a transaction’s steps.
Remote Unit of Work Remote Unit of Work
SQL statements originated at one location can be SQL statements originated at one location can be SQL statements originated at one location can be SQL statements originated at one location can be
executed as a single unit of work on a single executed as a single unit of work on a single
remote DBMS remote DBMS remote DBMS remote DBMSChapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall
Evolution of Distributed DBMS Evolution of Distributed DBMS Evolution of Distributed DBMS Evolution of Distributed DBMS
(cont.) (cont.)
Distributed Unit of Work Distributed Unit of Work
Different statements in a unit of work may refer to Different statements in a unit of work may refer to Different statements in a unit of work may refer to Different statements in a unit of work may refer to different remote sites different remote sites
All databases in a single SQL statement must be All databases in a single SQL statement must be at a single site at a single site ll
Distributed Request Distributed Request
A single SQL statement may refer to tables in A single SQL statement may refer to tables in more than one remote site more than one remote site
May not support replication transparency or failure May not support replication transparency or failure May not support replication transparency or failure May not support replication transparency or failure transparency transparency
Chapter 13 © 2005 by Prentice Hall © 2005 by Prentice Hall