Approaches to Web Content Analysis

27.7.5 Approaches to Web Content Analysis

The two main approaches to Web content analysis are (1) agent based (IR view) and (2) database based (DB view).

The agent-based approach involves the development of sophisticated artificial intelligence systems that can act autonomously or semi-autonomously on behalf of

a particular user, to discover and process Web-based information. Generally, the agent-based Web analysis systems can be placed into the following three categories:

Intelligent Web agents are software agents that search for relevant informa- tion using characteristics of a particular application domain (and possibly a user profile) to organize and interpret the discovered information. For example, an intelligent agent that retrieves product information from a vari- ety of vendor sites using only general information about the product domain.

Information Filtering/Categorization is another technique that utilizes Web agents for categorizing Web documents. These Web agents use methods from information retrieval, and semantic information based on the links among various documents to organize documents into a concept hierarchy.

Personalized Web agents are another type of Web agents that utilize the per- sonal preferences of users to organize search results, or to discover informa-

27.7 Web Search and Analysis 1025

preferences could be learned from previous user choices, or from other indi- viduals who are considered to have similar preferences to the user.

The database-based approach aims to infer the structure of the Website or to trans- form a Web site to organize it as a database so that better information management and querying on the Web become possible. This approach of Web content analysis primarily tries to model the data on the Web and integrate it so that more sophisti- cated queries than keyword-based search can be performed. These could be achieved by finding the schema of Web documents, building a Web document ware- house, a Web knowledge base, or a virtual database. The database-based approach

may use a model such as the Object Exchange Model (OEM) 27 that represents semi- structured data by a labeled graph. The data in the OEM is viewed as a graph, with objects as the vertices and labels on the edges. Each object is identified by an object identifier and a value that is either atomic—such as integer, string, GIF image, or HTML document—or complex in the form of a set of object references.

The main focus of the database-based approach has been with the use of multilevel databases and Web query systems. A multilevel database at its lowest level is a data- base containing primitive semistructured information stored in various Web repos- itories, such as hypertext documents. At the higher levels, metadata or generalizations are extracted from lower levels and organized in structured collec- tions such as relational or object-oriented databases. In a Web query system, infor- mation about the content and structure of Web documents is extracted and organized using database-like techniques. Query languages similar to SQL can then

be used to search and query Web documents. They combine structural queries, based on the organization of hypertext documents, and content-based queries.