Data Stream Management Processing High Speed Data Streams pdf pdf

  

Data-Centric Systems and Applications

Series editors

  M.J. Carey S. Ceri

  Editorial Board

  A. Ailamaki S. Babu P. Bernstein J.C. Freytag

  A. Halevy J. Han

  D. Kossmann

  I. Manolescu

  G. Weikum K.-Y. Whang J.X. Yu

  

More information about this series at

r r

  Minos Garofalakis Johannes Gehrke Rajeev Rastogi Editors

Data Stream Management

Processing High-Speed Data Streams

  Editors Minos Garofalakis Rajeev Rastogi School of Electrical and Amazon India Computer Engineering Bangalore, India Technical University of Crete Chania, Greece Johannes Gehrke Microsoft Corporation Redmond, WA, USA

ISSN 2197-9723

  ISSN 2197-974X (electronic) Data-Centric Systems and Applications

ISBN 978-3-540-28607-3

  ISBN 978-3-540-28608-0 (eBook) DOI 10.1007/978-3-540-28608-0 Library of Congress Control Number: 2016946344 Springer Heidelberg New York Dordrecht London © Springer-Verlag Berlin Heidelberg 2016

The fourth chapter in part 4 is published with kind permission of © 2004 Association for Computing

Machinery, Inc.. All rights reserved.

This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of

the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,

broadcasting, reproduction on microfilms or in any other physical way, and transmission or information

storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology

now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication

does not imply, even in the absence of a specific statement, that such names are exempt from the relevant

protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book

are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or

the editors give a warranty, express or implied, with respect to the material contained herein or for any

errors or omissions that may have been made. Printed on acid-free paper

  Contents Data Stream Management: A Brave New World . . . . . . . . . . . . .

   Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

  Part I Foundations and Basic Stream Synopses Data-Stream Sampling: Basic Techniques and Results . . . . . . . . . .

   Peter J. Haas Quantiles and Equi-depth Histograms over Streams . . . . . . . . . . .

   Michael B. Greenwald and Sanjeev Khanna Join Sizes, Frequency Moments, and Applications . . . . . . . . . . . .

   Top-k Frequent Item Maintenance over Streams . . . . . . . . . . . . .

  Moses Charikar . . . . . . . . . . . . . .

  Distinct-Values Estimation over Data Streams

  Phillip B. Gibbons

  

The Sliding-Window Computation Model and Results . . . . . . . . . .

  Mayur Datar and Rajeev Motwani

  Part II Mining Data Streams Clustering Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . vi Contents

  

Temporal Dynamics of On-Line Information Streams . . . . . . . . . .

  Jon Kleinberg

  Part III Advanced Topics Sketch-Based Multi-Query Processing over Data Streams . . . . . . . . Alin Dobra, Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi Approximate Histogram and Wavelet Summaries of Streaming Data . .

  

  S. Muthukrishnan and Martin Strauss

  

Stable Distributions in Streaming Computations . . . . . . . . . . . . .

  Graham Cormode and Piotr Indyk

  

Tracking Queries over Distributed Streams . . . . . . . . . . . . . . . .

  Minos Garofalakis

  Part IV System Architectures and Languages STREAM: The Stanford Data Stream Management System . . . . . . .

N. Laptev, B. Mozafari, H. Mousavi, H. Thakkar, H. Wang, K. Zeng,

  and Carlo Zaniolo

  

Hancock: A Language for Analyzing Transactional Data Streams . . .

  Corinna Cortes, Kathleen Fisher, Daryl Pregibon, Anne Rogers, and Frederick Smith

  

Sensor Network Integration with Streaming Database Systems . . . . .

  Daniel Abadi, Samuel Madden, and Wolfgang Lindner

  Part V Applications Stream Processing Techniques for Network Management . . . . . . . .

  Contents vii

  

Adaptive, Automatic Stream Mining . . . . . . . . . . . . . . . . . . . .

  Spiros Papadimitriou, Anthony Brockwell, and Christos Faloutsos

   Conclusions and Looking Forward . . . . . . . . . . . . . . . . . . . . .

  Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

  

Data Stream Management: A Brave New World

Minos Garofalakis, Johannes Gehrke, and Rajeev Rastogi

1 Introduction

  Traditional data-management systems software is built on the concept of persis- that are stored reliably in stable storage and queried/updated several

  tent data sets

  times throughout their lifetime. For several emerging application domains, how- ever, data arrives and needs to be processed on a continuous (24 × 7) basis, without the benefit of several passes over a static, persistent data image. Such continuous arise naturally, for example, in the network installations of large Tele-

  data streams

  com and Internet service providers where detailed usage information (Call-Detail- Records (CDRs), SNMP/RMON packet-flow data, etc.) from different parts of the underlying network needs to be continuously collected and analyzed for interest- ing trends. Other applications that generate rapid, continuous and large volumes of stream data include transactions in retail chains, ATM and credit card operations in banks, financial tickers, Web server log records, etc. In most such applications, the data stream is actually accumulated and archived in a database-management system of a (perhaps, off-site) data warehouse, often making access to the archived data prohibitively expensive. Further, the ability to make decisions and infer interesting

  M. Garofalakis ( ) B

  School of Electrical and Computer Engineering, Technical University of Crete, University Campus—Kounoupidiana, Chania 73100, Greece e-mail: J. Gehrke Microsoft Corporation, One Microsoft Way, Redmond, WA 98052-6399, USA e-mail: R. Rastogi Amazon India, Brigade Gateway, Malleshwaram (W), Bangalore 560055, India

  2 M. Garofalakis et al.

  Fig. 1 ISP network monitoring data streams

  patterns on-line (i.e., as the data stream arrives) is crucial for several mission-critical tasks that can have significant dollar value for a large corporation (e.g., telecom fraud detection). As a result, recent years have witnessed an increasing interest in designing data-processing algorithms that work over continuous data streams, i.e., algorithms that provide results to user queries while looking at the relevant data items only once and in a fixed order (determined by the stream-arrival pattern).

  

Example 1 (Application: ISP Network Monitoring) To effectively manage the op-

  eration of their IP-network services, large Internet Service Providers (ISPs), like AT&T and Sprint, continuously monitor the operation of their networking infras- tructure at dedicated Network Operations Centers (NOCs). This is truly a large-scale monitoring task that relies on continuously collecting streams of usage information from hundreds of routers, thousands of links and interfaces, and blisteringly-fast sets of events at different layers of the network infrastructure (ranging from fiber- cable utilizations to packet forwarding at routers, to VPNs and higher-level trans- port constructs). These data streams can be generated through a variety of network- monitoring tools (e.g., Cisco’s NetFlow [

   depicts an example ISP monitoring

  setup, with an NOC tracking NetFlow measurement streams from four edge routers in the network R –R . The figure also depicts a small fragment of the streaming

  1

  4

  data tables retrieved from routers R and R containing simple summary informa-

  1

  2

  tion for IP sessions. In real life, such streams are truly massive, comprising hundreds of attributes and billions of records—for instance, AT&T collects over one terabyte of NetFlow measurement data from its production network each day!

  Typically, this measurement data is periodically shipped off to a backend data warehouse for off-line analysis (e.g., at the end of the day). Unfortunately, such off-line analyses are painfully inadequate when it comes to critical network- management tasks, where reaction in (near) real-time is absolutely essential. Such tasks include, for instance, detecting malicious/fraudulent users, DDoS attacks, or Service-Level Agreement (SLA) violations, as well as real-time traffic engineering Data Stream Management: A Brave New World

  3

  it is crucial to process and analyze these continuous network-measurement streams in real-time and a single pass over the data (as it is streaming into the NOC), while, of course, remaining within the resource (e.g., CPU and memory) constraints of the NOC. (Recall that these data streams are truly massive, and there may be hundreds or thousands of analysis queries to be executed over them.)

  This volume focuses on the theory and practice of data stream management, and the difficult, novel challenges this emerging domain introduces for data- management systems. The collection of chapters (contributed by authorities in the field) offers a comprehensive introduction to both the algorithmic/theoretical foun- dations of data streams and the streaming systems/applications built in different domains. In the remainder of this introductory chapter, we provide a brief summary of some basic data streaming concepts and models, and discuss the key elements of a generic stream query processing architecture. We then give a short overview of the contents of this volume.

2 Basic Stream Processing Models

  When dealing with structured, tuple-based data streams (as in Example

   ), the

  streaming data can essentially be seen as rendering massive relational table(s) through a continuous stream of updates (that, in general, can comprise both in- sertions and deletions). Thus, the processing operations users would want to per- form over continuous data streams naturally parallel those in conventional database, OLAP, and data-mining systems. Such operations include, for instance, relational selections, projections, and joins,

  GROUP-BY aggregates and multi-dimensional data analyses, and various pattern discovery and analysis techniques. For several of these data manipulations, the high-volume and continuous (potentially, unbounded) nature of real-life data streams introduces novel, difficult challenges which are not addressed in current data-management architectures. And, of course, such chal- lenges are further exacerbated by the typical user/application requirements for con- tinuous, near real-time results for stream operations. As a concrete example, con- sider some of example queries that a network administrator may want to support over the ISP monitoring architecture depicted in Fig.

   .

  • To analyze frequent traffic patterns and detect potential Denial-of-Service (DoS) attacks, an example analysis query could be: Q1: “What are the top-100 most fre- .

  

quent IP (source, destination) pairs observed at router R1 over the past week?”

  This is an instance of a top-k (or, “heavy-hitters”) query—viewing the R1 as a (dynamic) relational table, it can be expressed using the standard SQL query language as follows:

  Q1: SELECT ip_source, ip_dest, COUNT(*) AS frequency FROM R1 GROUP BY ip_source, ip_dest ORDER BY COUNT(*) DESC

  4 M. Garofalakis et al.

  • To correlate traffic patterns across different routers (e.g., for the purpose of dy- namic packet routing or traffic load balancing), example queries might include: Q2: “How many distinct IP (source, destination) pairs have been seen by both R1

  

and R2, but not R 3?”, and Q3: “Count the number of session pairs in R1 and R2

where the source-IP in R1 is the same as the destination-IP in R2.” Q2 and Q3 are

  examples of (multi-table) set-expression and join-aggregate queries, respectively; again, they can both be expressed in standard SQL terms over the R1–R3 tables:

  Q2: SELECT COUNT(*) FROM ((SELECT DISTINCT ip_source, ip_dest FROM R1

  INTERSECT

SELECT DISTINCT ip_source, ip_dest FROM R2

) EXCEPT

SELECT DISTINCT ip_source, ip_dest FROM R3)

  Q3: SELECT COUNT(*) FROM R1, R2 WHERE R1.ip_source = R2.ip_dest

  A data-stream processing engine turns the paradigm of conventional database systems on its head: Databases typically have to deal with a stream of queries over a static, bounded data set; instead, a stream processing engine has to effectively process a static set of queries over continuous streams of data. Such stream queries can be (i) continuous, implying the need for continuous, real-time monitoring of the query answer over the changing stream, or (ii) ad-hoc query processing requests interspersed with the updates to the stream. The high data rates of streaming data might outstrip processing resources (both CPU and memory) on a steady or intermit- tent (i.e., bursty) basis; in addition, coupled with the requirement for near real-time results, they typically render access to secondary (disk) storage completely infeasi- ble.

  In the remainder of this section, we briefly outline some key data-stream man- agement concepts and discuss basic stream-processing models.

2.1 Data Streaming Models

  An equivalent view of a relational data stream is that of a massive, dynamic,

  one-dimensional vector A

  [1 . . . N]—this is essentially using standard techniques (e.g., row- or column-major). As a concrete example, Fig.

   depicts the stream

  vector A for the problem of monitoring active IP network connections between

  64

  source/destination IP addresses. The specific dynamic vector has 2 entries captur- ing the up-to-date frequencies for specific (source, destination) pairs observed in IP connections that are currently active. The size N of the streaming A vector is de- fined as the product of the attribute domain size(s) which can easily grow very large,

  

  especially for multi-attribute r The dynamic vector A is rendered through

1 Note that streaming algorithms typically do not require a priori knowledge of N .

  Data Stream Management: A Brave New World

  5 Fig. 2 Example dynamic vector modeling streaming network data

  a continuous stream of updates, where the j th update has the general form k, c[j] and effectively modifies the kth entry of A with the operation A [k] ← A[k] + c[j]. We can define three generic data streaming models

   ] based on the nature of these

  updates:

  • Time-Series Model. In this model, the j th update is j, A[j] and updates arrive in increasing order of j ; in other words, we observe the entries of the streaming vector A by increasing index. This naturally models time-series data streams, such as the series of measurements from a temperature sensor or the volume of NASDAQ stock trades over time. Note that this model poses a severe limitation on the update stream, essentially prohibiting updates from changing past (lower- index) entries in A.
  • Cash-Register Model. Here, the only restriction we impose on the j th update k, c[j] is that c[j] ≥ 0; in other words, we only allow increments to the entries of A but, unlike the Time-Series model, multiple updates can increment a given entry A [j] over the stream. This is a natural model for streams where data is just inserted/accumulated over time, such as streams monitoring the total packets exchanged between two IP addresses or the collection of IP addresses accessing a web server. In the relational case, a Cash-Register stream naturally captures the case of an append-only relational table which is quite common in practice (e.g., the fact table in a data warehouse ]).
  • Turnstile Model. In this, most general, streaming model, no restriction is im- posed on the j th update k, c[j] , so that c[j] can be either positive or negative; thus, we have a fully dynamic situation, where items can be continuously inserted and deleted from the stream. For instance, note that our example stream for moni- toring active IP network connections (Fig.

   ) is a Turnstile stream, as connections

  can be initiated or terminated between any pair of addresses at any point in the stream. (A technical constraint often imposed in this case is that A [j] ≥ 0 always holds—this is referred to as the strict Turnstile model [

  

  The above streaming models are obviously given in increasing order of general-

  6 M. Garofalakis et al.

  stile model (and, thus, are also applicable in the other two models). On the other hand, the weaker streaming models rely on assumptions that can be valid in certain application scenarios, and often allow for more efficient algorithmic solutions in cases where Turnstile solutions are inefficient and/or provably hard.

  Our generic goal in designing data-stream processing algorithms is to compute at different points during the lifetime of the

  functions (or, queries) on the vector A

  stream (continuous or ad-hoc). For instance, it is not difficult to see that the exam- ple queries Q1–Q3 mentioned earlier in this section can be trivially computed over stream vectors similar to that depicted in Fig.

   , assuming that the complete vec-

  tor(s) are available; similarly, other types of processing (e.g., data mining) can be easily carried out over the full frequency vector(s) using existing algorithms. This, however, is an unrealistic assumption in the data-streaming setting: The main chal- lenge in the streaming model of query computation is that the size of the stream vec- tor, N , is typically huge, making it impractical (or, even infeasible) to store or make multiple passes over the entire stream. The typical requirement for such stream pro- cessing algorithms is that they operate in small space and small time, where “space” refers to the working space (or, state) maintained by the algorithm and “time” refers to both the processing time per update (e.g., to appropriately modify the state of the algorithm) and the query-processing time (to compute the current query answer). Furthermore, “small” is understood to mean a quantity significantly smaller than (N ) (typically, poly-logarithmic in N ).

2.2 Incorporating Recency: Time-Decayed and Windowed Streams

  Streaming data naturally carries a temporal dimension and a notion of “time”. The conventional data streaming model discussed thus far (often referred to as landmark streams) assumes that the streaming computation begins at a well defined starting point t (at which the streaming vector is initialized to all zeros), and at any time t takes into account all streaming updates between t and t . In many applications, however, it is important to be able to downgrade the importance (or, weight) of older items in the streaming computation. For instance, in the statistical analysis of trends or patterns over financial data streams, data that is more than a few weeks old might naturally be considered “stale” and irrelevant. Various time-decay models have been proposed for streaming data, with the key differentiation lying in the relationship between an update’s weight and its age (e.g., exponential or polynomial decay ]). The sliding-window model [

  is one of the most prominent and intuitive time-decay

  models that essentially considers only a window of the most recent updates seen in the stream thus far—updates outside the window are automatically “aged out” (e.g., given a weight of zero). The definition of the window itself can be either time-

  (e.g., updates seen over the last W time units) or count-based (e.g., the last W

  based

  updates). The key limiting factor in this streaming model is, naturally, the size of the window W : the goal is to design query processing techniques that have space/time

  Data Stream Management: A Brave New World

  7 Fig. 3 General stream query processing architecture

3 Querying Data Streams: Synopses and Approximation A generic query processing architecture for streaming data is depicted in Fig.

  In

  contrast to conventional database query processors, the assumption here is that a stream query-processing engine is allowed to see the data tuples in relations only

  

once and in the fixed order of their arrival as they stream in from their respective

  source(s). Backtracking over a stream and explicit access to past tuples is impossi- ble; furthermore, the order of tuples arrival for each streaming relation is arbitrary and duplicate tuples can occur anywhere over the duration of the stream. Further- more, in the most general turnstile model, the stream rendering each relation can comprise tuple deletions as well as insertions.

  Consider a (possibly, complex) aggregate query Q over the input streams and let N denote an upper bound on the total size of the streams (i.e., the size of the complete stream vector(s)). Our data-stream processing engine is allowed a certain amount of memory, typically orders of magnitude smaller than the total size of its inputs. This memory is used to continuously maintain concise synopses/summaries of the streaming data (Fig.

   ). The two key constraints imposed on such stream

  synopses are:

  

(1) Single Pass—the synopses are easily maintained, during a single pass over the

  streaming tuples in the (arbitrary) order of their arrival; and,

  

(2) Small Space/Time—the memory footprint as well as the time required to up-

date and query the synopses is “small” (e.g., poly-logarithmic in N ).

  In addition, two highly desirable properties for stream synopses are:

  

(3) Delete-proof—the synopses can handle both insertions and deletions in the up-

  date stream (i.e., general turnstile streams); and,

  

(4) Composable—the synopses can be built independently on different parts of the

  stream and composed/merged in a simple (and, ideally, lossless) fashion to obtain

  8 M. Garofalakis et al.

  At any point in time, the engine can process the maintained synopses in order to obtain an estimate of the query result (in a continuous or ad-hoc fashion). Given that the synopsis construction is an inherently lossy compression process, exclud- ing very simple queries, these estimates are necessarily approximate—ideally, with some guarantees on the approximation error. These guarantees can be either de-

  

terministic (e.g., the estimate is always guaranteed to be within ǫ relative/absolute

  error of the accurate answer) or probabilistic (e.g., estimate is within ǫ error of the accurate answer except for some small failure probability δ). The properties of such

  • or (ǫ, δ)-estimates are typically demonstrated through rigorous analyses using ǫ known algorithmic and mathematical tools (including, sampling theory

   ], tail

  inequalities

   ], and so on). Such analyses typically establish a formal tradeoff

  between the space and time requirements of the underlying synopses and estimation algorithms, and their corresponding approximation guarantees.

  Several classes of stream synopses are studied in the chapters that follow, along with a number of different practical application scenarios. An important point to note here is that there really is no “universal” synopsis solution for data stream processing: to ensure good performance, synopses are typically purpose-built for the specific query task at hand. For instance, we will see different classes of stream synopses with different characteristics (e.g., random samples and AMS sketches) for supporting queries that rely on multiset/bag semantics (i.e., the full frequency distribution), such as range/join aggregates, heavy-hitters, and frequency moments (e.g., example queries Q1 and Q3 above). On the other hand, stream queries that rely on set semantics, such as estimating the number of distinct values (i.e., set cardinality) in a stream or a set expression over a stream (e.g., query Q2 above), can be more effectively supported by other classes of synopses (e.g., FM sketches and distinct samples). A comprehensive overview of synopsis structures and algorithms for massive data sets can be found in the recent survey of Cormode et al. [

  

4 This Volume: An Overview

  The collection of chapters in this volume (contributed by authorities in the field) offers a comprehensive introduction to both the algorithmic/theoretical foundations of data streams and the streaming systems/applications built in different domains. The authors have also taken special care to ensure that each chapter is, for the most part, self-contained, so that readers wishing to focus on specific streaming tech- niques and aspects of data-stream processing, or read about particular streaming systems/applications can move directly to the relevant chapter(s).

  Part I focuses on basic algorithms and stream synopses (such as random sam- ples and different sketching structures) for landmark and sliding-window streams, and some key stream processing tasks (including the estimation of quantiles, norms, join-aggregates, top-k values, and the number of distinct values). The chapters in Part II survey existing techniques for basic stream mining tasks, such as clustering, decision-tree classification, and the discovery of frequent itemsets and temporal dy- Data Stream Management: A Brave New World

  9

  algorithms and synopses for more complex queries and analytics, and techniques for querying distributed streams. The chapters in Part IV focus on the system and lan- guage aspects of data stream processing through comprehensive surveys of existing system prototypes and language designs. Part V then presents some representative applications of streaming techniques in different domains, including network man- agement, financial analytics, time-series analysis, and publish/subscribe systems. Finally, we conclude this volume with an overview of current data streaming prod- ucts and novel application domains (e.g., cloud computing, big data analytics, and complex event processing), and discuss some future directions in the field.

  References

  

1. S. Chaudhuri, U. Dayal, An overview of data warehousing and OLAP technology. ACM SIG-

MOD Record 26(1) (1997)

  2. W.G. Cochran, Sampling Techniques, 3rd edn. (Wiley, New York, 1977)

  

3. E. Cohen, M.J. Strauss, Maintaining time-decaying stream aggregates. J. Algorithms 59(1),

19–36 (2006)

  

4. G. Cormode, M. Garofalakis, P.J. Haas, C. Jermaine, Synopses for massive data: samples,

histograms, wavelets, sketches. Found. Trends® Databases 4(1–3) (2012)

  

5. C. Cranor, T. Johnson, O. Spatscheck, V. Shkapenyuk, GigaScope: a stream database for net-

work applications, in Proc. of the 2003 ACM SIGMOD Intl. Conference on Management of Data , San Diego, California (2003)

6. M. Datar, A. Gionis, P. Indyk, R. Motwani, Maintaining stream statistics over sliding windows.

  SIAM J. Comput. 31(6), 1794–1813 (2002)

  

7. M. Mitzenmacher, E. Upfal, Probability and Computing: Randomized Algorithms and Proba-

bilistic Analysis (Cambridge University Press, Cambridge, 2005)

  

8. R. Motwani, P. Raghavan, Randomized Algorithms (Cambridge University Press, Cambridge,

1995)

  

9. S. Muthukrishnan, Data streams: algorithms and applications. Found. Trends Theor. Comput.

  Sci. 1(2) (2005)

10. NetFlow services and applications. Cisco systems white paper (1999).

  

11. C.-E. Särndal, B. Swensson, J. Wretman, Model Assisted Survey Sampling (Springer, New

York, 1992). Springer Series in Statistics

  Part I Foundations and Basic Stream Synopses

  Data-Stream Sampling: Basic Techniques and Results Peter J. Haas

1 Introduction

  Perhaps the most basic synopsis of a data stream is a sample of elements from the stream. A key benefit of such a sample is its flexibility: the sample can serve as in- put to a wide variety of analytical procedures and can be reduced further to provide many additional data synopses. If, in particular, the sample is collected using ran- dom sampling techniques, then the sample can form a basis for statistical inference about the contents of the stream. This chapter surveys some basic sampling and in- ference techniques for data streams. We focus on general methods for materializing a sample; later chapters provide specialized sampling methods for specific analytic tasks.

  To place the results of this chapter in context and to help orient readers having a limited background in statistics, we first give a brief overview of finite-population sampling and its relationship to database sampling. We then outline the specific data-stream sampling problems that are the subject of subsequent sections.

1.1 Finite-Population Sampling

  Database sampling techniques have their roots in classical statistical methods for “finite-population sampling” (also called “survey sampling”). These latter methods are concerned with the problem of drawing inferences about a large finite population

  

  • from a small random sample of population elements; see [

  for comprehensive P.J. Haas ( )

  B

  IBM Almaden Research Center, San Jose, CA, USA

  14 P.J. Haas

  discussions. The inferences usually take the form either of testing some hypothesis about the population—e.g., that a disproportionate number of smokers in the popu- lation suffer from emphysema—or estimating some parameters of the population— e.g., total income or average height. We focus primarily on the use of sampling for estimation of population parameters.

  The simplest and most common sampling and estimation schemes require that the elements in a sample be “representative” of the elements in the population. The notion of simple random sampling ( SRS ) is one way of making this concept precise. To obtain an of size k from a population of size n, a sample element is selected

  SRS

  randomly and uniformly from among the n population elements, removed from the population, and added to the sample. This sampling step is repeated until k sample

  n

  elements are obtained. The key property of an SRS scheme is that each of the

  k possible subsets of k population elements is equally likely to be produced.

  Other “representative” sampling schemes besides SRS are possible. An impor-

  

  tant example is simple random sampling with replacement ( SRSWR The SRSWR scheme is almost identical to , except that each sampled element is returned to

  SRS

  the population prior to the next random selection; thus a given population element can appear multiple times in the sample. When the sample size is very small with respect to the population size, the SRS and SRSWR schemes are almost indistinguish- able, since the probability of sampling a given population element more than once is negligible. The mathematical theory of SRSWR is a bit simpler than that of SRS , so the former scheme is sometimes used as an approximation to the latter when ana- lyzing estimation algorithms based on . Other representative sampling schemes

  

SRS

  besides and include the “stratified” and “Bernoulli” schemes discussed

  SRS SRSWR in Sect. As will become clear in the sequel, certain non-representative sampling methods are also useful in the data-stream setting.

  Of equal importance to sampling methods are techniques for estimating popu- lation parameters from sample data. We discuss this topic in Sect.

   , and content ourselves here with a simple example to illustrate some of the basic issues involved.

  Suppose we wish to estimate the total income θ of a population of size n based on an of size k, where k is much smaller than n. For this simple example, a natural

  SRS

  estimator is obtained by scaling up the total income s of the individuals in the sam- ple, ˆ θ = (n/k)s, e.g., if the sample comprises 1 % of the population, then scale up the total income of the sample by a factor of 100. For more complicated population parameters, such as the number of distinct ZIP codes in a population of magazine subscribers, the scale-up formula may be much less obvious. In general, the choice of estimation method is tightly coupled to the method used to obtain the underlying sample.

  Even for our simple example, it is important to realize that our estimate is , since it depends on the particular sample obtained. For example, sup-

  random

  pose (rather unrealistically) that our population consists of three individuals, say Smith, Abbas, and Raman, whose respective incomes are $10,000, $50,000, and

1 Sometimes, to help distinguish between the two schemes more clearly, SRS is called simple ran- .

  Data-Stream Sampling: Basic Techniques and Results

  15 Table 1 Possible scenarios, along with probabilities, for a sampling and estimation exercise

  Sample Sample income Est. Pop. income Scenario probability

$60,000 $90,000 1/3

{Smith, Abbas}

$1,010,000 $1,515,000 1/3

{Smith, Raman}

$1,050,000 $1,575,000 1/3

{Abbas, Raman}

  $1,000,000. The total income for this population is $1,060,000. If we take an SRS of size k = 2—and hence estimate the income for the population as 1.5 times the income for the sampled individuals—then the outcome of our sampling and esti- mation exercise would follow one of the scenarios given in Table

  Each of the

  scenarios is equally likely, and the expected value (also called the “mean value”) of our estimate is computed as expected value

  = (1/3) · (90,000) + (1/3) · (1,515,000) + (1/3) · (1,575,000) = 1,060,000, which is equal to the true answer. In general, it is important to evaluate the accuracy

  (degree of systematic error) and precision (degree of variability) of a sampling and estimation scheme. The bias, i.e., expected error, is a common measure of accuracy, and, for estimators with low bias, the standard error is a common measure of pre- cision. The bias of our income estimator is 0 and the standard error is computed as the square root of the variance (expected squared deviation from the mean) of our estimator:

  

2

  2 SE ( 1/3)

  = · (90,000 − 1,060,000) + (1/3) · (1,515,000 − 1,060,000)

  1/2

  2 + (1/3) · (1,575,000 − 1,060,000) ≈ 687,000.

  For more complicated population parameters and their estimators, there are often no simple formulas for gauging accuracy and precision. In these cases, one can some- times resort to techniques based on subsampling, that is, taking one or more random samples from the initial population sample. Well known subsampling techniques for estimating bias and standard error include the “jackknife” and “bootstrap” methods; see [

  In general, the accuracy and precision of a well designed sampling-based es-

  timator should increase as the sample size increases. We discuss these issues further in Sect.

   .

1.2 Database Sampling

  Although database sampling overlaps heavily with classical finite-population sam-

  16 P.J. Haas

  • Scarce versus ubiquitous data. In the classical setting, samples are usually ex- pensive to obtain and data is hard to come by, and so sample sizes tend to be small. In database sampling, the population size can be enormous (terabytes of data), and samples are relatively easy to collect, so that sample sizes can be rel- atively large

   ]. The emphasis in the database setting is on the sample as a

  flexible, lossy, compressed synopsis of the data that can be used to obtain quick approximate answers to user queries.

  • Different sampling schemes. As a consequence of the complex storage for- mats and retrieval mechanisms that are characteristic of modern database sys- tems, many sampling schemes that were unknown or of marginal interest in the classical setting are central to database sampling. For example, the classical lit- erature pays relatively little attention to Bernoulli sampling schemes (described in Sect.

   below), but such schemes are very important for database sampling

  because they can be easily parallelized across data partitions [

   ]. As another

  example, tuples in a relational database are typically retrieved from disk in units of pages or extents. This fact strongly influences the choice of sampling and es- timation schemes, and indeed has led to the introduction of several novel meth-

  

   ]. As a final example, estimates of the answer to an aggregation query

  involving select–project–join operations are often based on samples drawn indi- vidually from the input base relations [

   ], a situation that does not arise in the classical setting.

  • No domain expertise. In the classical setting, sampling and estimation are often carried out by an expert statistician who has prior knowledge about the population being sampled. As a result, the classical literature is rife with sampling schemes that explicitly incorporate auxiliary information about the population, as well as “model-based” schemes

   , Chap. 5] in which the population is assumed to be a

  sample from a hypothesized “super-population” distribution. In contrast, database systems typically must view the population (i.e., the database) as a black box, and so cannot exploit these specialized techniques.

  • Auxiliary synopses. In contrast to a classical statistician, a database designer of- ten has the opportunity to scan each population element as it enters the system, and therefore has the opportunity to maintain auxiliary data synopses, such as an index of “outlier” values or other data summaries, which can be used to increase the precision of sampling and estimation algorithms. If available, knowledge of the query workload can be used to guide synopsis creation; see [

  for ex- amples of the use of workloads and synopses to increase precision.

  

   ] focused on methods for obtaining

  samples from various kinds of data structures, as well as on the maintenance of sample views and the use of sampling to provide approximate query answers within specified time constraints. A number of authors subsequently investigated the use of sampling in query optimization, primarily in the context of estimating the size of

   ,

  

  Attention then shifted to the use of sampling to con-

  struct data synopses for providing quick approximate answers to decision-support

  Data-Stream Sampling: Basic Techniques and Results

  17

  as a precursor to modern data-stream sampling techniques. Online-aggregation al- gorithms take, as input, streams of data generated by random scans of one or more (finite) relations, and produce continually-refined estimates of answers to aggre- gation queries over the relations, along with precision measures. The user aborts the query as soon as the running estimates are sufficiently precise; although the data stream is finite, query processing usually terminates long before the end of the stream is reached. Recent work on database sampling includes extensions of online aggregation methodology

   ], application of bootstrapping ideas to facilitate

  approximate answering of very complex aggregation queries [

  and development

  of techniques for sampling-based discovery of correlations, functional dependen- cies, and other data relationships for purposes of query optimization and data inte- gration [

  

  

  Collective experience has shown that sampling can be a very powerful tool, pro- vided that it is applied judiciously. In general, sampling is well suited to very quickly identifying pervasive patterns and properties of the data when a rough approxima- tion suffices; for example, industrial-strength sampling-enhanced query engines can speed up some common decision-support queries by orders of magnitude [

  On

  the other hand, sampling is poorly suited for finding “needles in haystacks” or for producing highly precise estimates. The needle-in-haystack phenomenon appears in numerous guises. For example, precisely estimating the selectivity of a join that re- turns very few tuples is an extremely difficult task, since a random sample from the

  

  base relations will likely contain almost no elements of the join result [

   As another example, sampling can perform poorly when data values are highly skewed.

  For example, suppose we wish to estimate the average of the values in a data set

  6

  8

  that consists of 10 values equal to 1 and five values equal to 10 . The five out- lier values are the needles in the haystack: if, as is likely, these values are not in- cluded in the sample, then the sampling-based estimate of the average value will be low by orders of magnitude. Even when the data is relatively well behaved, some population parameters are inherently hard to estimate from a sample. One notori- ously difficult parameter is the number of distinct values in a population

  

  Problems arise both when there is skew in the data-value frequencies and when there are many data values, each appearing a small number of times. In the for- mer scenario, those values that appear few times in the database are the needles in the haystack; in the latter scenario, the sample is likely to contain no dupli- cate values, in which case accurate assessment of a scale-up factor is impossible. Other challenging population parameters include the minimum or maximum data value; see [

  Researchers continue to develop new methods to deal with these

  problems, typically by exploiting auxiliary data synopses and workload informa- tion.

2 Fortunately, for query optimization purposes it often suffices to know that a join result is “small” without knowing exactly how small.

  18 P.J. Haas

1.3 Sampling from Data Streams

  Data-stream sampling problems require the application of many ideas and tech- niques from traditional database sampling, but also need significant new innova- tions, especially to handle queries over infinite-length streams. Indeed, the un- bounded nature of streaming data represents a major departure from the traditional setting. We give a brief overview of the various stream-sampling techniques consid- ered in this chapter.

  Our discussion centers around the problem of obtaining a sample from a win- , i.e., a subinterval of the data stream, where the desired sample size is much

  dow