File Organization and Storage Structures

File Organization and Storage Structures

  • – Primary Storage = Main Memory

File Organization and Storage Structures

  • Fast • Volatile • Expensive
    • – Secondary Storage = Files in disks or tapes

  • Non-Volatile

Secondary Storage is preferred for storing data

  File Organization and Storage Structures - 2

  o Storage of data

Basic Concepts

Logical Record Vs Physical Record

  File Organization and Storage Structures - 4

  o Logical record – Eg. The record of a staff (SG37).

  Generally, a physical record consists of more than one logical record

  o Information are stored in data files o Each file is a sequence of records o Each record consists of one or more fields

  • – “A record” o Physical record
  • – The unit of transfer between disk and primary storage.
  • – “A page”, “A block”

  File Organization and Storage Structures - 3

  CS3462 Introduction to Database Systems File Organization and Storage Structures - 1

  B3 WL220658D Deputy Ford SG14 B3 WL432514C Snr Asst Beech SG37 B5 WK440211B Manager White SL21 Bno NIN Position Lname Sno

Logical Record Vs Physical Record

File Organization & Access Method

  B5 WK440211B Manager White

  SA9

  Sno

  SG37

  B3 WL432514C Snr Asst Beech

  SG14

  B3 WL220658D Deputy Ford

  SG5

  B7 WM532187D Assistant Howe

  Bno NIN Position Lname

  B3 WK588932E Manager Brand

  o File Organization means the physical arrangement of data in a file into records and pages on secondary storage – Eg. Ordered files, indexed sequential file etc. o Access Method means the steps involved in storing and retrieving records from a file.

  1 Page B5 WA290573K Assistant Lee

  2

  SL21

  • – Eg. Using an indexed access method to retrieve a record from an indexed sequntial file.

  File Organization and Storage Structures - 6

  SL41

Heap Files o Heap files are files of unordered records

Ordered Files

  • – When a new record is created, it is put in the last page of the file if there is sufficient space. Otherwise a new page is added to the file.

  File Organization and Storage Structures - 7

  o Quick insertion (no particular ordering)

  o Slow retrieval (only allow linear search)

  • – reading pages from the file until a required record is found.

  o To delete a record, the record is marked as deleted.

  Space is reclaimed during periodical reoganization. File Organization and Storage Structures - 8

  o Ordered Files: Records are sorted on field(s) => Key o Allow Binary Searching Suppose one page stores one record. To search for SG37, search the middle page (6/2 = 3) first. We find that SG37 does not exist in this page(SG14). Then, since SG37 is greater than SG14, we search the middle page within the lower half of the file, and so on.

  CS3462 Introduction to Database Systems File Organization and Storage Structures - 5

Ordered Files

Direct Files

  • – If the appropriate page is full, may have to re- organize the whole file => Time consuming
  • – Solution: use a temporary unsorted file (transaction file). Merge to the sorted file periodically.

  CS3462 Introduction to Database Systems File Organization and Storage Structures - 9

  o Inserting a record

  o Rarely used unless come with an index => Indexed Sequential File o Both Heap Files and Ordered Files are also called Sequential Files. File Organization and Storage Structures - 10

  o Direct Files are also called Hash Files or Random Files o No need to write records sequentially o Use a hash function to calculate the number of the page (bucket) which a record should be located o Eg., use the division-remainder calculation method that, bucket_no = Record_key mod 3

Direct Files

Direct Files

  File Organization and Storage Structures - 11

  o Problem: If a new record SG41 is created, which bucket to go? o Collision Management

  Open addressing, Unchained overflow, Chained overflow, Multiple hashing File Organization and Storage Structures - 12

  Open Addressing o Upon a collision, the system performs a linear search to find the first available slot. o When last bucket has been searched, starts from the first bucket. o SL41 will be inserted to:

  Bucket 1

  Direct Files

Direct Files

  Chained Overflow Unchained Overflow o An overflow area is maintained for collisions. o Each bucket has a synonym pointer o Value of the synonym pointer: o SL41 will be inserted to:

  Zero: no collision occurred

Bucket 3

  Non-zero: the overflow bucket used

  File Organization and Storage Structures - 13 File Organization and Storage Structures - 14 Direct Files

Direct Files Multiple Hashing

Limitation (of Hashing)

  o Upon collision, apply a second hashing function to produce a new hash address in an overflow area.

  Inappropriate for some retrievals:

  • – based on pattern matching eg. Find all students with ID like 98xxxxxx.
  • – Involving ranges of values eg. Find all students from 50100000 to 50199999. File Organization and Storage Structures - 15
  • – Based on a field other than the hash field File Organization and Storage Structures - 16

  CS3462 Introduction to Database Systems

  Indexes

Indexes

  Index: A data structure that allows particular records in TERMINOLOGY a file to be located more quickly ~ Index in a book Data file: a file containing the logical records Index file: a file containing the index records An index can be sparse or dense: Indexing field: the field used to order the index records Sparse: record for only some of the search key values in the index file (eg. Staff Ids: CS001, EE001, MA001). Applicable to ordered data files only.

  Key: One or more fields which can uniquely identify a record (eg. No 2 students have the same student ID). Dense: record for every search key value. (eg. Staff Ids:

  CS001, CS002, .. CS089, EE001, EE002, ..) File Organization and Storage Structures - 17 File Organization and Storage Structures - 18

  Indexes

Indexed Sequential Files

  TYPES OF INDEXES What are Indexed Sequential Files? = A sorted data file with a primary index Primary Index: An index ordered in the same way as the data file, which is sequentially ordered Advantage of an Indexed Sequential File according to a key. (The indexing field is equal to Allows both sequential processing and individual this key.) record retrieval through the index. Secondary Index: An index that is defined on a non- Structure of an Indexed Sequential File ordering field of the data file. (The indexing field o A primary storage area need not contain unique values). o A separate index or indexes o An overflow area

A data file can associate with at most one primary index plus several secondary indexes

  File Organization and Storage Structures - 19 File Organization and Storage Structures - 20

  CS3462 Introduction to Database Systems

  • + +

  B -Trees

B -Trees +

  In B -Tree, data or indexes are stored in a hierarchy of o B => Balanced nodes o Consistent access time (for each access, same number of nodes are searched) TERMINOLOGY Degree (Order) : The maximum number of children allowed per parent. Depth : The maximum number of levels between the root node and a leaf node in the tree.

  Point to data File Organization and Storage Structures - 21 File Organization and Storage Structures - 22

  • + +

  B -Trees

B -Trees

  In practice, each node in the tree is actually a page, so we RULES (Cont’d): can store many pointers and keys. Eg. For a page size o For a tree or order n, the number of key values in a

  • + of 4KB, the B -Tree can be of order 512. leaf node must be between (n-1)/2 and (n-1) pointers Access time depends more ofen upon depth than on and children. If (n-1)/2 is not an integer, the result is breadth => Shallow trees are preferred. rounded up. RULES

  o The number of key values contained in a nonleaf node is 1 less than the number of pointers. o The root (if not a leaf node) must have at least 2 children o The tree must always be balanced: every path from the root node to a leaf must have the same length. o For a tree of order n, each node (except root and leaf) must have between n/2 and n pointers and children. If o Leaf nodes are linked in order of key values. n/2 is not an integer, the result is rounded up. File Organization and Storage Structures - 23 File Organization and Storage Structures - 24

  CS3462 Introduction to Database Systems

  • + -Trees Balancing can be costly to maintain. Example: Adding SG14
  • + -Trees Example: Adding SA9
  • + -Trees Example: Adding SA9

Summary

  • Heap files
  • Ordered Files (Binary Search)
  • Direct Files (Hashing)
  • IndexesIndexed Sequential FilesB
    • + - Trees

  CS3462 Introduction to Database Systems File Organization and Storage Structures - 25 B

  File Organization and Storage Structures - 26 B

  File Organization and Storage Structures - 27 B

  File Organization and Storage Structures - 28

  o Basic concepts (Files, Records, Fields) o Primary storage vs secondary storage o Logical record vs physical record o File Organization (and access methods)