Evaluation Measures of Search Relevance

27.6 Evaluation Measures of Search Relevance

Without proper evaluation techniques, one cannot compare and measure the rele-

27.6 Evaluation Measures of Search Relevance 1015

Evaluation techniques of IR systems measure the topical relevance and user relevance. Topical relevance measures the extent to which the topic of a result matches the topic of the query. Mapping one’s information need with “perfect” queries is a cognitive task, and many users are not able to effectively form queries that would retrieve results more suited to their information need. Also, since a major chunk of user queries are informational in nature, there is no fixed set of right answers to show to the user. User relevance is a term used to describe the “goodness” of a retrieved result with regard to the user’s information need. User rel- evance includes other implicit factors, such as user perception, context, timeliness, the user’s environment, and current task needs. Evaluating user relevance may also involve subjective analysis and study of user retrieval tasks to capture some of the properties of implicit factors involved in accounting for users’ bias for judging performance.

In Web information retrieval, no binary classification decision is made on whether a document is relevant or nonrelevant to a query (whereas the Boolean (or binary) retrieval model uses this scheme, as we discussed in Section 27.2.1). Instead, a rank- ing of the documents is produced for the user. Therefore, some evaluation measures focus on comparing different rankings produced by IR systems. We discuss some of these measures next.

27.6.1 Recall and Precision

Recall and precision metrics are based on the binary relevance assumption (whether each document is relevant or nonrelevant to the query). Recall is defined as the number of relevant documents retrieved by a search divided by the total number of existing relevant documents. Precision is defined as the number of relevant docu- ments retrieved by a search divided by the total number of documents retrieved by that search. Figure 27.5 is a pictorial representation of the terms retrieved vs. relevant and shows how search results relate to four different sets of documents.

Retrieved vs. relevant search results.

Hits

False

Yes Alarms

Correct Rejections

No

FN

TN

1016 Chapter 27 Introduction to Information Retrieval and Web Search

The notation for Figure 27.5 is as follows:

TP: true positive

FP: false positive

FN: false negative

TN: true negative The terms true positive, false positive, false negative, and true negative are generally

used in any type of classification tasks to compare the given classification of an item with the desired correct classification. Using the term hits for the documents that truly or “correctly” match the user request, we can define:

Recall = |Hits|/|Relevant| Precision = |Hits|/|Retrieved|

Recall and precision can also be defined in a ranked retrieval setting. The Recall at rank position i for document d q

i (denoted by r(i)) (d i is the retrieved document at position i for query q) is the fraction of relevant documents from d q to d q 1 i in the

result set for the query. Let the set of relevant documents from d q 1 to d q i in that set

be S i with cardinality | S i |. Let (|D q | be the size of relevant documents for the query. In this case,|S i | ≤ |D q |). Then:

Recall r(i) = |S i |/|D q |

The Precision at rank position i or document d q

i (denoted by p(i)) is the fraction of documents from d 1 to d i in the result set that are relevant:

Precision p(i) = |S i |/i

Table 27.2 illustrates the p(i), r(i), and average precision (discussed in the next section) metrics. It can be seen that recall can be increased by presenting more results to the user, but this approach runs the risk of decreasing the precision. In the

Table 27.2 Precision and Recall for Ranked Retrieval Doc. No.

Rank Position i

Relevant

Precision(i) Recall(i)

10 1 Yes

2 2 Yes

3 3 Yes

5 4 No

17 5 No

34 6 No

7 Yes

33 8 Yes

45 9 No

16 10 Yes

27.6 Evaluation Measures of Search Relevance 1017

example, the number of relevant documents for some query = 10. The rank posi- tion and the relevance of an individual document are shown. The precision and recall value can be computed at each position within the ranked list as shown in the last two columns.

27.6.2 Average Precision

Average precision is computed based on the precision at each relevant document in the ranking. This measure is useful for computing a single precision value to com- pare different retrieval algorithms on a query q.

P avg = ∑ dD q i ∈ q pi ()| D q |

Consider the sample precision values of relevant documents in Table 27.2. The aver- age precision (P avg value) for the example in Table 27.2 is P(1) + P(2) + P(3) + P(7) + P(8) + P(10)/6 = 79.93 percent (only relevant documents are considered in this calculation). Many good algorithms tend to have high top-k average precision for small values of k, with correspondingly low values of recall.

27.6.3 Recall/Precision Curve

A recall/precision curve can be drawn based on the recall and precision values at each rank position, where the x-axis is the recall and the y-axis is the precision. Instead of using the precision and recall at each rank position, the curve is com- monly plotted using recall levels r(i) at 0 percent, 10 percent, 20 percent...100 per- cent. The curve usually has a negative slope, reflecting the inverse relationship between precision and recall.

27.6.4 F-Score

F-score (F) is the harmonic mean of the precision (p) and recall (r) values. High precision is achieved almost always at the expense of recall and vice versa. It is a matter of the application’s context whether to tune the system for high precision or high recall. F-score is a single measure that combines precision and recall to com- pare different result sets:

2 = F pr

pr +

One of the properties of harmonic mean is that the harmonic mean of two numbers tends to be closer to the smaller of the two. Thus F is automatically biased toward the smaller of the precision and recall values. Therefore, for a high F-score, both precision and recall must be high.

1018 Chapter 27 Introduction to Information Retrieval and Web Search