Some General Issues Concerning Indexing
18.6 Some General Issues Concerning Indexing
18.6.1 Logical versus Physical Indexes
In the earlier discussion, we have assumed that the index entries <K, Pr> (or <K, P>) always include a physical pointer Pr (or P) that specifies the physical record address on disk as a block number and offset. This is sometimes called a physical index , and it has the disadvantage that the pointer must be changed if the record is moved to another disk location. For example, suppose that a primary file organiza- tion is based on linear hashing or extendible hashing; then, each time a bucket is split, some records are allocated to new buckets and hence have new physical addresses. If there was a secondary index on the file, the pointers to those records would have to be found and updated, which is a difficult task.
To remedy this situation, we can use a structure called a logical index, whose index entries are of the form <K, K p >. Each entry has one value K for the secondary index- ing field matched with the value K p of the field used for the primary file organiza- tion. By searching the secondary index on the value of K, a program can locate the corresponding value of K p and use this to access the record through the primary file organization. Logical indexes thus introduce an additional level of indirection between the access structure and the data. They are used when physical record addresses are expected to change frequently. The cost of this indirection is the extra search based on the primary file organization.
18.6.2 Discussion
In many systems, an index is not an integral part of the data file but can be created and discarded dynamically. That is why it is often called an access structure. Whenever we expect to access a file frequently based on some search condition involving a particular field, we can request the DBMS to create an index on that field. Usually, a secondary index is created to avoid physical ordering of the records in the data file on disk.
The main advantage of secondary indexes is that—theoretically, at least—they can
be created in conjunction with virtually any primary record organization. Hence, a secondary index could be used to complement other primary access methods such as ordering or hashing, or it could even be used with mixed files. To create a B + -tree secondary index on some field of a file, we must go through all records in the file to create the entries at the leaf level of the tree. These entries are then sorted and filled according to the specified fill factor; simultaneously, the other index levels are cre- ated. It is more expensive and much harder to create primary indexes and clustering indexes dynamically, because the records of the data file must be physically sorted on disk in order of the indexing field. However, some systems allow users to create these indexes dynamically on their files by sorting the file during index creation.
It is common to use an index to enforce a key constraint on an attribute. While
18.6 Some General Issues Concerning Indexing 669
time whether another record in the file—and hence in the index tree—has the same key attribute value as the new record. If so, the insertion can be rejected.
If an index is created on a nonkey field, duplicates occur; handling of these dupli- cates is an issue the DBMS product vendors have to deal with and affects data stor- age as well as index creation and management. Data records for the duplicate key may be contained in the same block or may span multiple blocks where many dupli- cates are possible. Some systems add a row id to the record so that records with duplicate keys have their own unique identifiers. In such cases, the B + -tree index
may regard a <key, Row_id> combination as the de facto key for the index, turning the index into a unique index with no duplicates. The deletion of a key K from such an index would involve deleting all occurrences of that key K—hence the deletion algorithm has to account for this.
In actual DBMS products, deletion from B + -tree indexes is also handled in various ways to improve performance and response times. Deleted records may be marked as deleted and the corresponding index entries may also not be removed until a garbage collection process reclaims the space in the data file; the index is rebuilt online after garbage collection.
A file that has a secondary index on every one of its fields is often called a fully inverted file . Because all indexes are secondary, new records are inserted at the end of the file; therefore, the data file itself is an unordered (heap) file. The indexes are usually implemented as B + -trees, so they are updated dynamically to reflect inser- tion or deletion of records. Some commercial DBMSs, such as Software AG’s Adabas, use this method extensively.
We referred to the popular IBM file organization called ISAM in Section 18.2. Another IBM method, the virtual storage access method (VSAM), is somewhat sim- ilar to the B + –tree access structure and is still being used in many commercial systems.
18.6.3 Column-Based Storage of Relations
There has been a recent trend to consider a column-based storage of relations as an alternative to the traditional way of storing relations row by row. Commercial rela- tional DBMSs have offered B + -tree indexing on primary as well as secondary keys as an efficient mechanism to support access to data by various search criteria and the ability to write a row or a set of rows to disk at a time to produce write-optimized systems. For data warehouses (to be discussed in Chapter 29), which are read-only databases, the column-based storage offers particular advantages for read-only queries. Typically, the column-store RDBMSs consider storing each column of data individually and afford performance advantages in the following areas:
Vertically partitioning the table column by column, so that a two-column table can be constructed for every attribute and thus only the needed columns can be accessed
Use of column-wise indexes (similar to the bitmap indexes discussed in Section 18.5.2) and join indexes on multiple tables to answer queries with-
670 Chapter 18 Indexing Structures for Files
Use of materialized views (see Chapter 5) to support queries on multiple columns
Column-wise storage of data affords additional freedom in the creation of indexes, such as the bitmap indexes discussed earlier. The same column may be present in multiple projections of a table and indexes may be created on each projection. To store the values in the same column, strategies for data compression, null-value sup- pression, dictionary encoding techniques (where distinct values in the column are assigned shorter codes), and run-length encoding techniques have been devised. MonetDB/X100, C-Store, and Vertica are examples of such systems. Further discus- sion on column-store DBMSs can be found in the references mentioned in this chapter’s Selected Bibliography.
Parts
» Fundamentals_of_Database_Systems,_6th_Edition
» Characteristics of the Database Approach
» Advantages of Using the DBMS Approach
» A Brief History of Database Applications
» Schemas, Instances, and Database State
» The Three-Schema Architecture
» The Database System Environment
» Centralized and Client/Server Architectures for DBMSs
» Classification of Database Management Systems
» Domains, Attributes, Tuples, and Relations
» Key Constraints and Constraints on NULL Values
» Relational Databases and Relational Database Schemas
» Integrity, Referential Integrity, and Foreign Keys
» Update Operations, Transactions, and Dealing with Constraint Violations
» SQL Data Definition and Data Types
» Specifying Constraints in SQL
» The SELECT-FROM-WHERE Structure of Basic SQL Queries
» Ambiguous Attribute Names, Aliasing, Renaming, and Tuple Variables
» Substring Pattern Matching and Arithmetic Operators
» INSERT, DELETE, and UPDATE Statements in SQL
» Comparisons Involving NULL and Three-Valued Logic
» Nested Queries, Tuples, and Set/Multiset Comparisons
» The EXISTS and UNIQUE Functions in SQL
» Joined Tables in SQL and Outer Joins
» Grouping: The GROUP BY and HAVING Clauses
» Discussion and Summary of SQL Queries
» Specifying General Constraints as Assertions in SQL
» Introduction to Triggers in SQL
» Specification of Views in SQL
» View Implementation, View Update, and Inline Views
» Schema Change Statements in SQL
» Sequences of Operations and the RENAME Operation
» The UNION, INTERSECTION, and MINUS Operations
» The CARTESIAN PRODUCT (CROSS PRODUCT) Operation
» Variations of JOIN: The EQUIJOIN and NATURAL JOIN
» Additional Relational Operations
» Examples of Queries in Relational Algebra
» The Tuple Relational Calculus
» The Domain Relational Calculus
» Using High-Level Conceptual Data Models
» Entity Types, Entity Sets, Keys, and Value Sets
» Relationship Types, Relationship Sets, Roles, and Structural Constraints
» ER Diagrams, Naming Conventions, and Design Issues
» Example of Other Notation: UML Class Diagrams
» Relationship Types of Degree Higher than Two
» Subclasses, Superclasses, and Inheritance
» Constraints on Specialization and Generalization
» Specialization and Generalization Hierarchies
» Modeling of UNION Types Using Categories
» A Sample UNIVERSITY EER Schema, Design Choices, and Formal Definitions
» Data Abstraction, Knowledge Representation, and Ontology Concepts
» ER-to-Relational Mapping Algorithm
» Discussion and Summary of Mapping for ER Model Constructs
» Mapping EER Model Constructs
» The Role of Information Systems
» The Database Design and Implementation Process
» Use of UML Diagrams as an Aid to Database Design Specification 6
» Rational Rose: A UML-Based Design Tool
» Automated Database Design Tools
» Introduction to Object-Oriented Concepts and Features
» Object Identity, and Objects versus Literals
» Complex Type Structures for Objects and Literals
» Encapsulation of Operations and Persistence of Objects
» Type Hierarchies and Inheritance
» Other Object-Oriented Concepts
» Object-Relational Features: Object Database Extensions to SQL
» Overview of the Object Model of ODMG
» Built-in Interfaces and Classes in the Object Model
» Atomic (User-Defined) Objects
» Extents, Keys, and Factory Objects
» The Object Definition Language ODL
» Differences between Conceptual Design of ODB and RDB
» Mapping an EER Schema to an ODB Schema
» Query Results and Path Expressions
» Overview of the C++ Language Binding in the ODMG Standard
» Structured, Semistructured, and Unstructured Data
» XML Hierarchical (Tree) Data Model
» Well-Formed and Valid XML Documents and XML DTD
» XPath: Specifying Path Expressions in XML
» XQuery: Specifying Queries in XML
» Extracting XML Documents from
» Database Programming: Techniques
» Retrieving Single Tuples with Embedded SQL
» Retrieving Multiple Tuples with Embedded SQL Using Cursors
» Specifying Queries at Runtime Using Dynamic SQL
» SQLJ: Embedding SQL Commands in Java
» Retrieving Multiple Tuples in SQLJ Using Iterators
» Database Programming with SQL/CLI Using C
» JDBC: SQL Function Calls for Java Programming
» Database Stored Procedures and SQL/PSM
» PHP Variables, Data Types, and Programming Constructs
» Overview of PHP Database Programming
» Imparting Clear Semantics to Attributes in Relations
» Redundant Information in Tuples and Update Anomalies
» Normal Forms Based on Primary Keys
» General Definitions of Second and Third Normal Forms
» Multivalued Dependency and Fourth Normal Form
» Join Dependencies and Fifth Normal Form
» Inference Rules for Functional Dependencies
» Minimal Sets of Functional Dependencies
» Properties of Relational Decompositions
» Dependency-Preserving Decomposition
» Dependency-Preserving and Nonadditive (Lossless) Join Decomposition into 3NF Schemas
» Problems with NULL Values and Dangling Tuples
» Discussion of Normalization Algorithms and Alternative Relational Designs
» Further Discussion of Multivalued Dependencies and 4NF
» Other Dependencies and Normal Forms
» Memory Hierarchies and Storage Devices
» Hardware Description of Disk Devices
» Magnetic Tape Storage Devices
» Placing File Records on Disk
» Files of Unordered Records (Heap Files)
» Files of Ordered Records (Sorted Files)
» External Hashing for Disk Files
» Hashing Techniques That Allow Dynamic File Expansion
» Other Primary File Organizations
» Parallelizing Disk Access Using RAID Technology
» Types of Single-Level Ordered Indexes
» Some General Issues Concerning Indexing
» Algorithms for External Sorting
» Implementing the SELECT Operation
» Implementing the JOIN Operation
» Algorithms for PROJECT and Set
» Notation for Query Trees and Query Graphs
» Heuristic Optimization of Query Trees
» Catalog Information Used in Cost Functions
» Examples of Cost Functions for SELECT
» Examples of Cost Functions for JOIN
» Example to Illustrate Cost-Based Query Optimization
» Factors That Influence Physical Database Design
» Physical Database Design Decisions
» An Overview of Database Tuning in Relational Systems
» Transactions, Database Items, Read and Write Operations, and DBMS Buffers
» Why Concurrency Control Is Needed
» Transaction and System Concepts
» Desirable Properties of Transactions
» Serial, Nonserial, and Conflict-Serializable Schedules
» Testing for Conflict Serializability of a Schedule
» How Serializability Is Used for Concurrency Control
» View Equivalence and View Serializability
» Types of Locks and System Lock Tables
» Guaranteeing Serializability by Two-Phase Locking
» Dealing with Deadlock and Starvation
» Concurrency Control Based on Timestamp Ordering
» Multiversion Concurrency Control Techniques
» Validation (Optimistic) Concurrency
» Granularity of Data Items and Multiple Granularity Locking
» Using Locks for Concurrency Control in Indexes
» Other Concurrency Control Issues
» Recovery Outline and Categorization of Recovery Algorithms
» Caching (Buffering) of Disk Blocks
» Write-Ahead Logging, Steal/No-Steal, and Force/No-Force
» Transaction Rollback and Cascading Rollback
» NO-UNDO/REDO Recovery Based on Deferred Update
» Recovery Techniques Based on Immediate Update
» The ARIES Recovery Algorithm
» Recovery in Multidatabase Systems
» Introduction to Database Security Issues 1
» Discretionary Access Control Based on Granting and Revoking Privileges
» Mandatory Access Control and Role-Based Access Control for Multilevel Security
» Introduction to Statistical Database Security
» Introduction to Flow Control
» Encryption and Public Key Infrastructures
» Challenges of Database Security
» Distributed Database Concepts 1
» Types of Distributed Database Systems
» Distributed Database Architectures
» Data Replication and Allocation
» Example of Fragmentation, Allocation, and Replication
» Query Processing and Optimization in Distributed Databases
» Overview of Transaction Management in Distributed Databases
» Overview of Concurrency Control and Recovery in Distributed Databases
» Current Trends in Distributed Databases
» Distributed Databases in Oracle 13
» Generalized Model for Active Databases and Oracle Triggers
» Design and Implementation Issues for Active Databases
» Examples of Statement-Level Active Rules
» Time Representation, Calendars, and Time Dimensions
» Incorporating Time in Relational Databases Using Tuple Versioning
» Incorporating Time in Object-Oriented Databases Using Attribute Versioning
» Temporal Querying Constructs and the TSQL2 Language
» Spatial Database Concepts 24
» Multimedia Database Concepts
» Clausal Form and Horn Clauses
» Datalog Programs and Their Safety
» Evaluation of Nonrecursive Datalog Queries
» Introduction to Information Retrieval
» Types of Queries in IR Systems
» Evaluation Measures of Search Relevance
» Web Analysis and Its Relationship to Information Retrieval
» Analyzing the Link Structure of Web Pages
» Approaches to Web Content Analysis
» Trends in Information Retrieval
» Data Mining as a Part of the Knowledge
» Goals of Data Mining and Knowledge Discovery
» Types of Knowledge Discovered during Data Mining
» Market-Basket Model, Support, and Confidence
» Frequent-Pattern (FP) Tree and FP-Growth Algorithm
» Other Types of Association Rules
» Approaches to Other Data Mining Problems
» Commercial Data Mining Tools
» Data Modeling for Data Warehouses
» Difficulties of Implementing Data Warehouses
» Grouping, Aggregation, and Database Modification in QBE
Show more