COP 5725 Fall 2012 Database Management Systems

  COP 5725 Fall 2012 Database Management

Systems

  University of Florida, CISE Department Prof. Daisy Zhe Wang Adapted Slides from Prof. Jeff Ullman 1 Data Warehousing and Data Mining Warehousing OLAP

  Data Mining

  2 Introduction

  organizations are analyzing

  • Increasingly, current and historical data to identify useful patterns and support business strategies.
  • Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly sta
  • On-Line Analytic Processing (OLAP) vs. On-line

  Transaction Processing (OLTP)

  OLTP

  • Most database operations involve

  On- (OTLP).

Line Transaction Processing

  • – Short, simple, frequent queries and/or modifications, each involving a small number of tuples.
  • – Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.

  OLAP

  • Of increasing importance are

  On-Line Application Processing (OLAP) queries.

  • – Few, but complex queries --- may run for hours.
  • – Queries do not depend on having an absolutely up-to-date database.

OLAP Examples

  1. Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer.

Common Architecture

  • Databases at store branches handle OLTP queries.
  • Local store databases copied to a central warehouse overnight.
  • Analysts use the warehouse for OLAP and data mining.

  Three Complementary Trends

  • Data Warehousing: Consolidate data from many sources in one large repository.
    • – Loading, periodic synchronization of replicas.
    • – Semantic integration.

  • OLAP: – Complex SQL queries and views.
    • – Queries based on spreadsheet-style operations and “multidimensional” view of data.
    • – Interactive and “online” queries.

  • Data Mining: Exploratory search for interesting trends and anomalies.

   SOURCES Data Warehousing

   EXTRACT

  • Integrated data spanning

  TRANSFORM

  long time periods, often

  augmented with summary information.

  • Several gigabytes to

DATA Metadata WAREHOUSE terabytes common. Repository

  • Interactive response

  SUPPORTS

  times expected for complex queries; ad-hoc updates uncommon.

  Warehousing Issues

  When getting data from

  • Semantic Integration:

  multiple sources, must eliminate mismatches, e.g., different currencies, schemas.

  • Heterogeneous Sources: Must access data

  from a variety of source formats and repositories.

  Must load data,

  • Load, Refresh, Purge: periodically refresh it, and purge too-old data.

  Must keep track of

  • Metadata Management:

  source, loading time, and other information for all data in the warehouse.

  id Multidimensional s id me ti loc sale pid

  Data Model

  11 1 1 25

  measures, which

  • Collection of numeric

  11 2 1 8 depend on a set of dimensions. 11 3 1 15

  Sales , dimensions

  • – E.g., measure

  12 1 1 30

  Product (key: pid), Location (locid), and Time (timeid).

  12 2 1 20 12 3 1 50

  8 10 10

  Slice locid=1 13 1 1 8

  pid 30 20 50

  is shown: 13 2 1 10

  25 8 15 11 12 13

  13 3 1 10

  locid 1 2 3

  11 1 2 35

   timeid MOLAP vs. ROLAP systems

  (bar, beer, drinker, time, price) MOLAP and Data Cubes

  systems: Multidimensional data are stored physically in a (disk-resident, persistent) multi dimensional array.

  • MOLAP
  • ROLAP systems: Multidimensional data are stored as relations.
    • – The main relation, which relates dimensions to a

      measure, is called the fact table . Each dimension

      can have additional attributes and an associated

      dimension table .
    • – E.g., Sales(pid, locid, timeid, sales) or Sales

  • Keys of dimension tables are the dimensions of a hypercube.
    • – Example: for the Sales (bar,beer,drinker,time,price) data, the four dimensions are bar , beer , drinker , and time .

  • Dependent attributes (e.g., price ) appear at the points of the cube.

  Visualization - Data Cubes price bar beer

  Time?

  4 th dimension Data Cube Marginals

  • The data cube also includes aggregation (typically SUM) along the margins of the cube.
  • The include aggregations

  marginals over one dimension, two dimensions,… Visualization - Data Cube w/ Aggregation beer price bar

ROLAP and Star Schemas

  • A is a common organization

  star schema for data at a warehouse. It consists of:

  1. Fact table : a very large accumulation of facts such as sales.

  Often “insert-only.”

  : smaller, generally static

  2. Dimension tables

  information about the entities involved in the facts. Example: Star Schema

  • Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged.
  • The fact table is a relation:

  Sales(bar, beer, drinker, day, time, price)

  Example, Continued

  • The dimension tables include information about the bar, beer, and drinker “dimensions”:

  Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)

Visualization – Star Schema

  

Dimension Table (Bars) Dimension Table (Drinkers)

Dimension Attrs. Dependent Attrs.

  Fact Table - Sales

Dimension Table (Beers) Dimension Table (Time, etc.) Dimensions and Dependent Attributes

  • Two classes of fact-table attributes:

  : the key of a

  1. Dimension attributes dimension table.

  : a value

  2. Dependent attributes

  determined by the dimension attributes of the tuple. category week month state pname date city year quarter country

  Dimension Hierarchies

  • For each dimension, the set of values can be organized in a hierarchy:

  OLAP Queries • Influenced by SQL and by spreadsheets. aggregate a

  • A common operation is to measure over one or more dimensions.
    • – Find total sales.
    • – Find total sales for each city, or for each state.
    • – Find top five products ranked by total sales.

  Aggregating at different levels of a

  • Roll-up: dimension hierarchy.
    • – E.g., Given total sales by city, we can roll-up to get

      sales by state.

  OLAP Queries The inverse of roll-up.

  • Drill-down:
    • – E.g., Can also drill-down on different dimension to get total sales by product for each year/quarter, etc.
    • – E.g., Given total sales by state, can drill-down to get

      total sales by city.

  Aggregation on selected dimensions.

  • Pivoting:

   WI CA Total

  • – E.g., Pivoting on Location and Time yields this :

  cross-tabulation 63 81 144 1995 1996 38 107 145

  Slicing and Dicing: Equality and range selections on one

  1997 75 35 110 or more dimensions. 176 223 339 Total

  63 81 144 1995

  Comparison with 1996 38 107 145

  SQL Queries 1997 75 35 110

  • The cross-tabulation

  176 223 339 Total

  obtained by pivoting

   WI CA Total

SELECT SUM (S.sales)

FROM

  Sales S, Times T, Locations L

WHERE S.timeid=T.timeid AND S.timeid=L.timeid

GROUP BY

  T.year, L.state

SELECT SUM (S.sales) SELECT SUM (S.sales)

FROM

  FROM Sales S, Times T Sales S, Location L

  

WHERE S.timeid=T.timeid WHERE S.timeid=L.timeid

GROUP BY

  T.year L.state The CUBE Operator

  • Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL

  queries that can be generated through pivoting on a subset of dimensions.

  • CUBE pid, locid, timeid BY SUM Sales
    • – Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid};
    • – each roll-up corresponds to an SQL query of the form:

  SELECT SUM (S.sales) FROM Sales S GROUP BY grouping-list Design Issues • Fact table in BCNF; dimension tables un-normalized.

  • – Dimension tables are small; updates/inserts/deletes are rare.

  So, anomalies less important than query performance.

  price category pname pid country state city locid sales locid timeid pid holiday_flag week date timeid month quarter year

  (Fact table) SALES TIMES PRODUCTS

  LOCATIONS

  • This kind of schema is very common in OLAP applications, and is called a star schema ; computing the join of all these relations is called a star join .

  ROLAP Techniques

  1. New indexing techniques: Bitmap indexes, Join indexes

  2. Array representations, compression,

  3. Pre-computation of aggregations (i.e., materialized views), etc.

  4. We are going to cover in more details on:

  • Bitmap indexes
  • Materialized views

  Bitmap Index sex rating F M custid name sex rating Bit-vector:

  112 Joe M 3

  10 00100

  1 bit for each

  115 Ram M 5

  10 00001 possible value.

  119 Sue F

  5

  01 00001

  Many queries can be answered using

  10 00010

  112 Woo M 4

  bit-vector ops! For each key value of a dimension table create a bit-vector telling which tuples of the fact table have that value.

  Example OLAP Query

  • Often, OLAP queries begin with a “ star join ”: the natural join of the fact table with all or most of the dimension tables.
  • Example: SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’

  GROUP BY bar, beer;

Materialized Views

  • Store the answers to several useful queries (views) in the warehouse itself.
  • A direct execution of an OLAP query from the fact and the dimension tables could take too long (even with bitmap indexes)
  • If we create a materialized view that contains enough information, we may be able to answer our query much faster.

  Materialized Views (cont.)

  • A view whose tuples are stored in the database is said to be materialized .
    • – Provides fast access, like a (very high-level) cache.
    • – Need to maintain the view as the underlying tables change.
    • – Ideally, we want incremental view maintenance algorithms.

  Example OLAP Query

  SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’ GROUP BY bar, beer;

  Example --- Continued

  • Here is a materialized view that could help:

  CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS

  

SELECT bar, addr, beer, manf,

SUM(price) sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf;

  Since bar -> addr and beer -> manf, there is no real grouping. We need addr and manf in the SELECT. Example --- Concluded

  • Here’s our query using the materialized view BABMS: SELECT bar, beer, sales FROM BABMS WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’;

  Views in DW for OLAP queries • OLAP queries are typically aggregate queries.

  • – Precomputation is essential for interactive response

    times.
  • – The CUBE is in fact a collection of aggregate queries, and precomputation is especially important
  • – lots of work on what is best to precompute given a

    limited amount of space to store precomputed results.
    • Warehouses can be thought of as a collection of periodically updated tables and periodically maintained views.

  View Modification (Evaluate On Demand)

  CREATE VIEW RegionalSales (category,sales,state)

  AS SELECT P.category, S.sales, L.state View

  FROM Products P, Sales S, Locations L

  WHERE AND P.pid=S.pid S.locid=L.locid

  SELECT SUM R.category, R.state, (R.sales) Query

  FROM AS GROUP BY RegionalSales R R.category, R.state

  

SELECT SUM

R.category, R.state, (R.sales)

  FROM SELECT ( P.category, S.sales, L.state

  Modified FROM

  Products P, Sales S, Locations L Query

  WHERE AND AS P.pid=S.pid S.locid=L.locid ) R

  GROUP BY R.category, R.state View Materialization (Precomputation)

  • Suppose we precompute RegionalSales and store it with a clustered B+ tree index on [category,state,sales].
    • – Then, previous query can be answered by an index- only scan.

  SELECT SUM R.category, R.state, (R.sales)

  RegionalSales R R.category, R.state View Materialization (Precomputation)

  • Suppose we precompute RegionalSales and store it with a clustered B+ tree index on [category,state,sales].
    • – Then, previous query can be answered by an index- only scan.

  SELECT R.state,

  SUM (R.sales)

  FROM RegionalSales R

  WHERE R.category =“Laptop”

  R.state SELECT

  R.state, SUM

  (R.sales) FROM

  RegionalSales R WHERE

  R. state=“Wisconsin”

  R.category Index on precomputed view is great! Index is less useful (must scan entire leaf level). Data Mining Definition Data mining is the exploration and analysis of large quantities of data in order to discover , , valid novel potentially , and ultimately useful understandable patterns in data.

  Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%

  

Data Mining Definition (Cont.)

Data mining is the exploration and analysis of large quantities of

data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

  Valid : The patterns hold in general. Novel : We did not know the pattern beforehand.

  Useful : We can devise actions from the patterns.

  Understandable : We can interpret and comprehend the patterns.

  Why Use Data Mining Today?

  Human analysis skills are inadequate:

  • – Volume and dimensionality of the data
  • – High data growth rate
  • – 3V in Big Data: Volume, Velocity, Variety

Availability of:

  • – Data – Storage – Computational power
  • – Off-the-shelf software
  • – Expertise

Preprocessing and Mining

  Original Data Target Data Preprocessed

  Data Patterns Knowledge

  Data Integration and Selection Preprocessing

  Model Construction Interpretation Example Application: Sports

IBM Advanced Scout analyzes NBA game statistics

  • – Shots blocked
  • – Assists – Fouls • Google: “IBM Advanced Scout”

  Advanced Scout

  • Example pattern: An analysis of the data from a game played between the New York Knicks and the Charlotte Hornets revealed that “

  When Glenn Rice played the shooting guard position, he shot 5/6 (83%) on jump shots."

  • Pattern is interesting:

  The average shooting percentage for the Charlotte Hornets during that game was 54%.

  Data Mining Techniques

  • Supervised learning
    • – Classification and regression

  • Unsupervised learning
    • – Clustering

  • Dependency modeling
    • – Associations, summarization, causality

    >Outlier and deviation detection
  • • Trend analysis and change detection

  Market Basket Analysis (Example Dependency Modeling)

  • Consider shopping cart filled with several items
  • Market basket analysis tries to answer the following questions:
    • – Who makes purchases?
    • – What do customers buy together?
    • – In what order do customers purchase items?

  Market Basket Analysis

  Given:

  • A database of customer transactions
  • Each transaction is a set of items
  • Example:

  Transaction with TID 111 contains items {Pen, Ink,

  TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice

  4

Milk, Juice}

  Market Basket Analysis (Contd.)

  • Coocurrences
    • – 80% of all customers purchase items X, Y and Z together.

  • Association rules
    • – 60% of all customers who purchase X and Y also buy Z.

  • Sequential patterns
    • – 60% of customers who first buy X also purchase Y within three weeks.

  Confidence and Support We prune the set of all possible association rules using two interestingness measures: of a rule:

  • Confidence
    • – X  Y has confidence c if P(Y|X) = c

  • Support of a rule:
    • – X  Y has support s if P(XY) = s

  We can also define of an itemset (a co-ocurrence)

  • Support

  XY:

  • – XY has support s if P(XY) = s

  Example

  Examples:

  • {Pen} => {Milk}

  Support: 75% Confidence: 75%

  • {Ink} => {Pen}

  Support: 75%

  TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice

Confidence: 100%

  4 TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice

  4 TID CID Date Item Qty 111 201 5/1/99 Pen 2 111 201 5/1/99 Ink 1 111 201 5/1/99 Milk 3 111 201 5/1/99 Juice 6 112 105 6/3/99 Pen 1 112 105 6/3/99 Ink 1 112 105 6/3/99 Milk 1 113 106 6/5/99 Pen 1 113 106 6/5/99 Milk 1 114 201 7/1/99 Pen 2 114 201 7/1/99 Ink 2 114 201 7/1/99 Juice

  Example

  • Find all itemsets with support >= 75%?

  4 Market Basket Analysis: Applications

  Example

  • Can you find all association rules with support >= 50%?

  • Sample Applications
    • – Direct marketing
    • – Floor/shelf planning
    • – Web site layout
    • – Cross-selling

Frequent Itemset Algorithms

  • Applications • More abstract problem
  • Breadth-first search

Applications of Frequent Itemsets

  • Market Basket Analysis • Association Rules • Classification (especially: text, rare classes)
  • Seeds for construction of Bayesian Networks • Collaborative filtering

  Problem Abstract:

  Concrete:

  • A set of items {1,2,…,k}
  • I = {milk, bread, cheese, &hell
  • A dabase of transactions
  • D = { {milk,bread,cheese},

  (itemsets ) D={T1, T2, …, Tn}, Tj subset {1,2,…,k}

  {bread,cheese,juice }, …} GOAL:

  GOAL: Find all itemsets that appear in at

  Find all itemsets that appear in least x transactions at least 1000 transactions

  (“appear in” == “are subsets of”) I subset T: T supports

  I {milk,bread,cheese} supports {milk,bread}

  For an itemset I, the number of transactions it appears in is called the support of I. x is called the minimum support .

  Problem (Contd.) Definitions:

  Example: frequent if it is D={ {1,2,3}, {1,2,3}, {1,2,3},

  • An itemset is a subset of at least x {1,2,4} } transactions. (FI.) Minimum support x = 3
  • An itemset is maximally

  frequent if it is frequent and {1,2} is frequent it does not have a frequent

  {1,2,3} is maximal frequent superset. (MFI.) Support( {1,2} ) = 4

  GOAL: Given x, find all frequent All maximal frequent itemsets: (maximally frequent)

  {1,2,3} itemsets (to be stored in the FI (MFI) ).

  Obvious relationship: MFI subset FI

The Itemset Lattice

  {} {2} {1} {4} {3} {1,2}

  {2,3} {1,3} {1,4} {2,4} {1,2,3,4} {1,2,3}

  {3,4} {1,2,4} {1,3,4} {2,3,4} Frequent Itemsets Frequent itemsets

  Infrequent itemsets {} {2} {1} {4} {3} {1,2}

  {2,3} {1,3} {1,4} {2,4} {1,2,3,4} {1,2,3} {3,4} {1,2,4} {1,3,4} {2,3,4} Breath First Search: 1-Itemsets {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}

  Frequent The Apriori Principle:

  Currently examined I infrequent  ( I union {x} ) infrequent

  Don’t know Breath First Search: 2-Itemsets {}

  

{1} {2} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}

  Frequent Currently examined Don’t know Breath First Search: 3-Itemsets {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Infrequent {1,2,3,4}

  Frequent The Apriori Principle:

  Currently examined I infrequent  ( I union {x} ) infrequent

  Don’t know Breadth First Search: Remarks

  • We prune infrequent itemsets and avoid to count them
  • To find an itemset with k items, we need to

  k

  count all 2 subsets

  • Breadth first search uses Apriori algorithm:

  Next, we show how to implement A-Priori in SQL

  Finding Frequent Pairs

  • • The simplest case is when we only want

    to find “frequent pairs” of items.
  • Assume data is in a relation Baskets(basket, item) .
  • The

  is the support threshold s minimum number of baskets in which a

pair appears before we are interested.

  65 Frequent Pairs in SQL Look for two Basket tuples

  SELECT b1.item, b2.item

  with the same basket and

  FROM Baskets b1, Baskets b2 different items. WHERE b1.basket = b2.basket

  First item must precede second,

  AND b1.item < b2.item

  so we don’t count the same

  GROUP BY b1.item, b2.item pair twice. HAVING COUNT(*) >= s;

  Create a group for Throw away pairs of items each pair of items that do not appear at least that appears in at s times. least one basket. 66 A-Priori Trick --- (1)

  • Straightforward implementation involves a join of a huge Baskets relation with itself.
  • The

  speeds the a-priori algorithm query by recognizing that a pair of items { i , j } cannot have support s unless both { } and { } do. i j

  67 A-Priori Trick --- (2)

  to hold only

  • Use a materialized view information about frequent items.

  INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets

  Items that

  WHERE item IN (

  appear in at least s baskets.

  SELECT item FROM Baskets GROUP BY item HAVING COUNT(*) >= s ); 68

  • Computing
  • Baskets1
    • – Running time shrinks with the square of the number of tuples involved in the join.

  69 A-Priori Algorithm 1. Materialize the view Baskets1 .

  Baskets1 is cheap, since it doesn’t involve a join.

  probably has many fewer tuples than Baskets . Two Observations

  • (if x>=y, f(x)>=f(y))

  Monotonic function

  • (if x>=y, f(x)<=f(y)) Antimonotonic fn.
  • can be applied to any constraint P

  Apriori

  that is antimonotone (e.g., support>const) – Start from the empty set.

  supersets of sets that do not satisfy P.

  • – Prune
    • can also be applied to a monotone

  Apriori

  constraint Q (e.g., sum>const) – Start from set of all items instead of empty set.

  

subsets of sets that do not satisfy Q.

  • – Prune

  Negative Pruning an Antimonotone P

  {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4}

  Frequent Infrequent {1,2,3,4}

  Currently examined Don’t know Negative Pruning an Antimonotone P

  {} {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4}

  {1,2,3} {1,2,4} {1,3,4} {2,3,4} Frequent Infrequent

  {1,2,3,4} Currently examined Don’t know Negative Pruning an Antimonotone P

  {} {1} {2} {3} {4}

  {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4} Frequent Infrequent {1,2,3,4} Currently examined Don’t know

  Negative Pruning a Monotone Q {}

  {1} {2} {3} {4} {1,2} {1,3} {1,4} {2,3} {2,4} {3,4} {1,2,3} {1,2,4} {1,3,4} {2,3,4}

  Satisfies Q Doesn’t satisfy Q {1,2,3,4} Currently examined Don’t know The New Problem

  New Goal:

  antimonotone

  • Given constraints P and Q, with P (support) and Q monotone (statistical constraint).
  • Find all itemsets that satisfy both P and Q.

  Recent solutions:

  • Newer algorithms can handle both P and Q

  Conceptual Illustration of Problem {} All supersets satisfy Q

  Satisfies Q Satisfies P & Q Satisfies P

  All subsets satisfy P

  D

Summary

  • Decision support is an emerging, rapidly growing subarea of databases.
  • Involves the creation of large, consolidated data repositories called data warehouses.
  • Warehouses exploited using sophisticated analysis techniques: complex SQL queries and OLAP “multidimensional” queries (influenced by both SQL and spreadsheets).
  • New techniques for database design, indexing, view maintenance, and interactive querying need to be supported.

  Summary (Cont.)

  • Data Mining
    • – Supervised – Unsupervised – Dependency modeling
    • – Outlier detection
    • – Trend analysis and prediction