COP 5725 Fall 2012 Database Management Systems

University of Florida, CISE Department Prof. Daisy Zhe Wang

External Sorting External Merge Sort Internal Sort

Optimizations Why Sort?

A classic problem in computer science!
Data requested in sorted order

e.g., find students in increasing gpa order

– • Sorting is first step in bulk loading B+ tree index.

Sorting useful for eliminating duplicate copies in a collection of records (Why?) • Sort-merge join algorithm involves sorting.
Problem: sort 1Gb of data with 1Mb of RAM.

2-Way Sort: Requires 3 Buffers • Pass 1: Read a page, sort it, write it.

– only one buffer page is used

Pass 2, 3, …, etc.:

– three buffer pages used.

Main memory buffers ^{INPUT 1} ^{INPUT 2} ^OUTPUT _Disk _Disk Two-Way External Merge Sort

Each pass we read + write each page in file.
N pages in the file => the number of passes
So toal cost is:
Idea: Divide and

1 N

conquer: sort subfiles and merge

    log ₂

   

1 ₂ N N log  ^{Input file} ^{1-page runs} ^{2-page runs} ^{4-page runs}

8-page runs ^{PASS 0} ^{PASS 1} ^{PASS 2} ^{PASS 3} ^{3,4 6,2 9,4 8,7 5,6 3,1} ^1,3 ² ^{3,4 5,6 2,6 4,9 7,8}

^4,6 ²

^2,3 ^4,7 ^8,9 ^1,3 ^5,6 ^4,4 ² ^2,3 ^6,7 ^8,9 ^1,2 ^3,5 ^2,3 ⁶ ^1,2 ^3,4 _4,5 _6,6 7,8 General External Merge Sort

More than 3 buffer pages. How can we utilize them?

N B

To sort a file with pages using buffer pages:

/ ^{N B} Pass 0: use B buffer pages. Produce sorted runs

– of B pages each.

_{INPUT 1}

 

– INPUT 2 _OUTPUT . . .

. . . _{INPUT B-1} . . . _Disk _{B Main memory buffers} Disk Cost of External Merge Sort N B

1  log / _B ₁   

Number of passes:

 

Cost = 2N * (# of passes)
E.g., with 5 buffer pages, to sort 108 page file:

Pass 0: = 22 sorted runs of 5 pages 108 5 /

– each (last run is only 3 pages) Pass 1: = 6 sorted runs of 20 pages

 

– 22 4 /  

each (last run is only 8 pages) Pass 2: 2 sorted runs, 80 pages and 28 pages

– Pass 3: Sorted file of 108 pages
–

Number of Passes of External Sort

3 10,000,000

3 100,000,000

3 1,000,000

4 1,000,000,000 30

N B=3 B=5 B=9 B=17 B=129 B=257 100

1 1,000

2 10,000

2 100,000

4 Internal Sort Algorithm (I)

– Quick Sort • Quicksort is a fast way to sort in memory.
– On average O(nlogn)
– In practice, faster than other O(nlogn) algorithms
– Sequential and localized memory references work well with a cache

_{2<>– Worst case O(n ) – Generate runs with B pages}

Internal Sort Algorithm (II)

– Heap Sort • An alternative is “tournament sort” (a.k.a.

“heapsort”)

Average and worst case are both O(nlogn)

– Not have to worry about worst case scenario, which could be cause by malicious attach

Generate average length of a run:

longer runs -> fewer passes!

Internal Sort Algorithm (II)

– Heap Sort (Cont.)

B-2

Top: Read in blocks
Fill: Find the smallest record greater than the largest value to output buffer
If exist

– add it to the end of the output buffer
– fill moved record’s slot with next value from the input buffer, if empty refill input buffer

Else

– fill end run and start a new run

goto Fill

Heap Sort Example

2 ₅

10 ¹² ⁴ ³ ₈ Input buffer Current set Output buffer

Heap Sort Example (Cont.) ₁₂ ₈

2 ₃

4 ₁₀ ₅ Input buffer Output buffer Current set

More on Heapsort

Fact: average length of a is

The “snowplow” analogy: Imagine a

– snowplow moving around a circular track on which snow falls at a steady rate. At any instant, there is a certain
– amount of snow S on the track. Some falling snow comes in front of the plow, some behind.

B During the next revolution of the

– plow, all of this is removed, plus 1/2 of what falls during that revolution. Thus, the plow removes 2S amount
–

Increase Performance for External

Decrease # passes

– quicksort -> heapsort
– using indexes that contain all necessary fields

Decrease I/O cost

– Blocked I/O
– Double Buffer

Blocked I/O

So far, do I/O a page at a time (input/output)
Increase I/O efficiency by read a block of pages sequentially!
Suggests we should make each buffer (input/output) be a block of pages.

–
But this will reduce fan-out during merge passes!
⌊( − b)/b⌋
– In practice, most files still sorted in 2-3 passes .

2 1,000,000

3 1,000,000,000

3 100,000,000

2 10,000,000

Number of Passes of Optimized Sort

1 100,000

1 10,000

1 1,000

N B=1,000 B=5,000 B=10,000 100

Block size b = 32, initial pass produces runs of size 2B.

OUTPUT _OUTPUT' Disk _Disk ^{INPUT 1} ^{INPUT k} ^{INPUT 2} ^{INPUT 1'} ^{INPUT 2'} _{INPUT k'} ^{block size} ^b

Double Buffering

To reduce CPU wait time for I/O request to complete, can prefetch into ` shadow block ’.

– Potentially, more passes; in practice, most files still sorted in 2-3 passes .
– In reality, CPU works on other processes

B main memory buffers, k-way merge Sorting Records!

Sorting has become a blood sport!

– Parallel sorting is the name of the game ...

Sort 1M records of size 100 bytes

– Typical DBMS: 15 minutes

seconds

– World record: 3.5

12-CPU SGI machine, 96 disks, 2GB of RAM
New benchmarks proposed:

–
Minute Sort: How many can you sort in 1 minute?
– Dollar Sort: How many can you sort for $1.00?

Using B+ Trees for Sorting

Scenario: Table to be sorted has B+ tree index on sorting column(s).
Idea: Can retrieve records in order by traversing leaf pages.
Is this a good idea?
Cases to consider:

Good idea! B+ tree is clustered

– B+ tree is not clustered Could be a very bad
–

Clustered B+ Tree Used for Sorting

Cost: root to the left-

_Index

_{(Directs search)}

_{("Sequence set")}

_{Data Entries <If Alternative 2 is used? Additional cost of retrieving data records: each page fetched just once.
^{Data Records <
Always better than external sorting!}}

(Directs search) ^Index Data Entries _{("Sequence set")}

Unclustered B+ Tree Used for Sorting

Alternative (2) for data entries; each data entry contains rid of a data record. In general, one I/O per data record!

1,000 2,000 1,000 10,000 100,000 10,000 40,000 10,000 100,000 1,000,000 100,000 600,000 100,000 1,000,000 10,000,000 1,000,000 8,000,000 1,000,000 10,000,000 100,000,000 10,000,000 80,000,000 10,000,000 100,000,000 1,000,000,000

External Sorting vs. Unclustered Index N Sorting p=1 p=10 p=100 100 200 100 1,000 10,000

p: # of records per page
B=1,000 and block size=32 for sorting

External sorting is important; DBMS may dedicate part of buffer pool for sorting!
External merge sort minimizes disk I/O cost:

Pass 0: Produces sorted runs of size B (#

– merge buffer pages). Later passes: runs.
– block size.

# of runs merged at a time depends on

B, and

Larger block size means less I/O cost per page.

– Larger block size means smaller # runs merged.
–

Summary, cont.

Choice of internal sort algorithm may matter:

Quicksort: Quick!

–
Heap/tournament sort: slower (2x), longer
– runs

The best sorts are wildly fast:

Despite 40+ years of research, we’re still

– improving!

Clustered B+ tree is good for sorting;

COP 5725 Fall 2012 Database Management Systems

2-Way Sort: Requires 3 Buffers • Pass 1: Read a page, sort it, write it.

1 N

Dokumen yang terkait

Relational Database Management System

Book Wetland Systems Storm Water Management Control Green Energy and Technology

Relational Database Management Systems for Epidemiologists: SQL Part I

Modifikasi Metode Timestamped Change Data Capture pada RDBMS (Relational Database Management System)

Concurrency Control and Recovery in Database Systems pdf pdf

Concepts of Database Management 8th Edition pdf pdf

Oracle Nosql Database Management Enterprise 5721 pdf pdf

Database Reliability Engineering Designing and Operating Resilient Database Systems pdf pdf

Strategic Management of Information Systems in Healthcare

COP 5725 Fall 2012 Database Management Systems

Dukungan

Links

COP 5725 Fall 2012 Database Management Systems

2-Way Sort: Requires 3 Buffers • Pass 1: Read a page, sort it, write it.

1 N

Dokumen yang terkait

Relational Database Management System

Book Wetland Systems Storm Water Management Control Green Energy and Technology

Relational Database Management Systems for Epidemiologists: SQL Part I

Modifikasi Metode Timestamped Change Data Capture pada RDBMS (Relational Database Management System)

Concurrency Control and Recovery in Database Systems pdf pdf

Concepts of Database Management 8th Edition pdf pdf

Oracle Nosql Database Management Enterprise 5721 pdf pdf

Database Reliability Engineering Designing and Operating Resilient Database Systems pdf pdf

Strategic Management of Information Systems in Healthcare

COP 5725 Fall 2012 Database Management Systems

Dokumen yang Anda mencari sudah siap untuk unduhkan