Problems with Operational Data

Problems with Operational Data

Most operational databases have problems that limit their usefulness to all but the simplest BI applications. Figure 13-4 lists the major problem categories.

First, although data that are critical for successful operations must be complete and accurate, data that are only marginally necessary need not be. For example, some operational systems gather customer demographic data during the ordering process. But, because such data are not needed to fill, ship, or bill orders, the quality of the demographic data suffers.

Problematic data are termed dirty data. Examples are a value of “G” for customer sex and a value of “213” for customer age. Other examples are a value of “999-999-9999” for a U.S. phone number, a part color of “gren,” and an e-mail address of “WhyMe@somewhereelseintheuniverse. who.” All of these values pose problems for reporting and data mining purposes.

Purchased data often contain missing elements. In fact, most data vendors state the percentage of missing values for each attribute in the data they sell. An organization buys such data because, for some uses, some data are better than no data at all. This is especially true for data items whose values are difficult to obtain, such as the number of adults in a household, household

Chapter 13 Database Processing for Business Intelligence Systems

• Dirty data • Missing values • Inconsistent data • Data not integrated • Wrong format

– Too fine – Not fine enough

Figure 13-4

• Too much data

Problems of Using

– Too many attributes

Transaction Data for

– Too much volume

Business Intelligence

income, dwelling type, and the education of primary income earner. Some missing data are not too much of a problem for reporting applications. For data mining applications, however, a few missing or erroneous data points can actually be worse than no data at all, because they bias the analysis.

Inconsistent data, the third problem in Figure 13-4, is particularly common for data that have been gathered over time. When an area code changes, for example, the phone number for

a given customer before the change will differ from the customer’s phone number after the change. Part codes can change, as can sales territories. Before such data can be used, it must be recoded for consistency over the period of the study.

Some data inconsistencies occur because of the nature of the business activity. Consider a Web-based order entry system used by customers around the world. When the Web server records the time of order, which time zone does it use? The server’s system clock time is irrelevant to an analysis of customer behavior. Any standard time such as Universal Time Coordinate (UTC) time is also meaningless. Somehow, Web server time must be adjusted to the time zone of the customer.

Another problem is nonintegrated data. Suppose, for example, that an organization wants to report on customer order and payment behavior. Unfortunately, order data are stored in a Microsoft Dynamics CRM system, whereas payment data are recorded in an Oracle PeopleSoft financial management database. To perform the analysis, the data must somehow be integrated.

The next problem is that data can be inappropriately formatted. First, data can be too fine. For example, suppose, that we want to analyze the placement of graphics and controls on an order entry Web page. It is possible to capture the customers’ clicking behavior in what is termed click-stream data. However, click-stream data include everything the customer does. In the middle of the order stream there may be data for clicks on the news, e-mail, instant chat, and the weather. Although all of this data might be useful for a study of consumer computer behavior, it will be overwhelming if all we want to know is how customers respond to an ad located on the screen. Because the data are too fine, the data analysts must throw millions and millions of clicks away before they can proceed.

Data can also be too coarse. A file of order totals cannot be used for a market basket analysis, which identifies items that are commonly purchased together. Market basket analyses require item-level data; we need to know which items were purchased with which others. This doesn’t mean the order total data are useless; it can be adequate for other analyses, it just won’t do for a market basket analysis.

If the data are too fine, they can be made coarser by summing and combining. An analyst and a computer can sum and combine such data. If the data are too coarse, however, they cannot be separated into their constituent parts.

The final problem listed in Figure 3-4 concerns data volume. We can have an excess of columns, rows, or both. Suppose that we want to know the attributes that influence customers’ responses to a promotion. Between customer data stored within the organization and customer data that can be purchased, we might have a hundred or more different attributes, or columns, to consider. How do we select among them? Because of a phenomenon called the curse of dimensionality, the more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor. For this and other reasons, the number of attributes should be reduced and one of the major activities in data mining concerns the efficient and effective selection of variables.

Part 5 Database Access Standards

• Name, Address, Phone • Age, Gender • Ethnicity, Religion • Income • Education • Marital Status, Life Stage • Height, Weight, Hair and Eye Color • Spouse’s Name, Birth Date, etc. • Kids’ Names and Birth Dates • Voter Registration • Home Ownership • Vehicles • Magazine Subscriptions

• Catalog Orders • Hobbies

Figure 13-5

AmeriLINK Sells Data on

• Attitudes

230+ Million Americans

Finally, we may have too many instances, or rows, of data. Suppose that we want to ana- lyze click-stream data on CNN.com. How many clicks does this site receive per month? Millions upon millions! To meaningfully analyze such data, we need to reduce the number of instances. A good solution to this problem is statistical sampling. However, developing a reliable sample requires specialized expertise and information system tools.