Text-centric vs. data-centric XML retrieval

Preliminary draft c 2008 Cambridge UP 214 10 XML retrieval content only full structure improvement precision at 5 0.2000 0.3265 63.3 precision at 10 0.1820 0.2531 39.1 precision at 20 0.1700 0.1796 5.6 precision at 30 0.1527 0.1531 0.3 ◮ Table 10.4 A comparison of content-only and full-structure search in INEX 20032004. language-model-based system cf. Chapter 12 that is evaluated on a subset of CAS topics from INEX 2003 and 2004. The evaluation metric is precision at k as defined in Chapter 8 page 161 . The discretization function used for the evaluation maps highly relevant elements roughly corresponding to the 3E elements defined for Q to 1 and all other elements to 0. The content- only system treats queries and documents as unstructured bags of words. The full-structure model ranks elements that satisfy structural constraints higher than elements that do not. For instance, for the query in Figure 10.3 an element that contains the phrase summer holidays in a section will be rated higher than one that contains it in an abstract. The table shows that structure helps increase precision at the top of the results list. There is a large increase of precision at k = 5 and at k = 10. There is almost no improvement at k = 30. These results demonstrate the benefits of structured retrieval. Structured retrieval imposes additional constraints on what to return and documents that pass the structural filter are more likely to be relevant. Recall may suffer because some relevant documents will be filtered out, but for precision-oriented tasks structured retrieval is superior.

10.5 Text-centric vs. data-centric XML retrieval

In the type of structured retrieval we cover in this chapter, XML structure serves as a framework within which we match the text of the query with the text of the XML documents. This exemplifies a system that is optimized for text-centric XML . While both text and structure are important, we give higher TEXT - CENTRIC XML priority to text. We do this by adapting unstructured retrieval methods to handling additional structural constraints. The premise of our approach is that XML document retrieval is characterized by i long text fields e.g., sec- tions of a document, ii inexact matching, and iii relevance-ranked results. Relational databases do not deal well with this use case. In contrast, data-centric XML mainly encodes numerical and non-text attribute- DATA - CENTRIC XML value data. When querying data-centric XML, we want to impose exact match conditions in most cases. This puts the emphasis on the structural aspects of XML documents and queries. An example is: Preliminary draft c 2008 Cambridge UP

10.5 Text-centric vs. data-centric XML retrieval

215 Find employees whose salary is the same this month as it was 12 months ago. This query requires no ranking. It is purely structural and an exact matching of the salaries in the two time periods is probably sufficient to meet the user’s information need. Text-centric approaches are appropriate for data that are essentially text documents, marked up as XML to capture document structure. This is be- coming a de facto standard for publishing text databases since most text documents have some form of interesting structure – paragraphs, sections, footnotes etc. Examples include assembly manuals, issues of journals, Shake- speare’s collected works and newswire articles. Data-centric approaches are commonly used for data collections with com- plex structures that mainly contain non-text data. A text-centric retrieval engine will have a hard time with proteomic data in bioinformatics or with the representation of a city map that together with street names and other textual descriptions forms a navigational database. Two other types of queries that are difficult to handle in a text-centric struc- tured retrieval model are joins and ordering constraints. The query for em- ployees with unchanged salary requires a join. The following query imposes an ordering constraint: Retrieve the chapter of the book Introduction to algorithms that follows the chapter Binomial heaps. This query relies on the ordering of elements in XML – in this case the order- ing of chapter elements underneath the book node. There are powerful query languages for XML that can handle numerical attributes, joins and ordering constraints. The best known of these is XQuery, a language proposed for standardization by the W3C. It is designed to be broadly applicable in all ar- eas where XML is used. Due to its complexity, it is challenging to implement an XQuery-based ranked retrieval system with the performance characteris- tics that users have come to expect in information retrieval. This is currently one of the most active areas of research in XML retrieval. Relational databases are better equipped to handle many structural con- straints, particularly joins but ordering is also difficult in a database frame- work – the tuples of a relation in the relational calculus are not ordered. For this reason, most data-centric XML retrieval systems are extensions of rela- tional databases see the references in Section 10.6 . If text fields are short, exact matching meets user needs and retrieval results in form of unordered sets are acceptable, then using a relational database for XML retrieval is ap- propriate. Preliminary draft c 2008 Cambridge UP 216 10 XML retrieval

10.6 References and further reading