Web Search Engines: Key Concepts and Evaluation Metrics

What is Web Search?

Web search provides access to heterogeneous, distributed information that is publicly available on the World Wide Web.

  • Key Elements: User, Web, Crawler/Spider, Indexer, Ads, Query Processor

The spider builds the corpus, collecting web pages recursively. The indexer creates inverted indexes, with various policies regarding which words are indexed, capitalization, and support for Unicode. The query processor serves query results. The front end handles query reformulation, word stemming, and optimization for Booleans. The back end finds matching documents and ranks them.

Query Processing

Query processing involves a semantic analysis of the query, determining the language, filtering words, and identifying specific query types (personalities, cities, medical information, stock quotes, news, company information). It also determines the user’s location, the target location of the query, remembers previous queries, and maintains a user profile.


The Web

Characters in a language are encoded such that each character is paired with a number.

There are 16,000-51,000 content types. What to do with them? How to parse them? The Apache Tika toolkit detects and extracts metadata and text content from various documents using existing parser libraries. Tika unifies these parsers under a single interface. Tika is useful for search engine indexing.


Search Engine Evaluation

Precision and Recall

Precision = #(relevant data items retrieved) / #(all retrieved items)

Recall = #(relevant items retrieved) / #(all relevant items)

RelevantNonrelevant
RetrievedTPFP
Not RetrievedFNTN

The accuracy of a search engine is defined as the fraction of these classifications that are correct: (TP + TN) / (TP + FP + FN + TN)


dOmkgaSBpIGmgmwb+Pxa8pKi4HKWNAAAAAElFTkSuQmCC


You can get high recall (but low precision) by retrieving all documents for all queries. In a good system, precision decreases as the number of documents retrieved (or recall) increases.


Three Means: Arithmetic, Geometric, Harmonic

The geometric mean is the nth root of the product of n numbers.

The harmonic mean tends strongly towards the least element of the list, making it useful in analyzing search engine results.

To find the harmonic mean of a set of n numbers:

  1. Add the reciprocals of the numbers in the set.
  2. Divide the sum by n.
  3. Take the reciprocal of the result.

Example: 3, 6, 9, 12

  • Arithmetic mean: 7.5
  • Geometric mean: nth-root(3*6*9*12) = 4th-root(1944) = 6.64
  • Harmonic mean: (1/3 + 1/6 + 1/9 + 1/12) = 0.17, 1 / 0.17 = 5.88

The harmonic mean of the precision and the recall is often used as an aggregated performance score for the evaluation of algorithms and systems. This is called the F-score.

+Bz2kM7Sgzp4sAAAAAElFTkSuQmCC


Mean Average Precision (MAP)

Mean Average Precision (MAP) for a set of queries is the mean of the average precision scores for each query.

g6jxxDA6GB0EDPGgjQ7ll18WBoIDQQGhi+Bv4Xo1DTm5pryVkAAAAASUVORK5CYII=

Summarize rankings from multiple queries by averaging average precision. This is the most commonly used measure in research papers. It assumes the user is interested in finding many relevant documents for each query and requires many relevance judgments in the text collection. Note: if a relevant document never gets retrieved, we assume the precision corresponding to that relevant document to be zero. Each query counts equally.

Problems with Precision and Recall

Precision and recall should be averaged over large document collections and query ensembles. They need human relevance assessments, and assessments have to be binary. They are heavily skewed by collection/authorship.


Discounted Cumulative Gain (DCG)

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized, as the graded relevance value is reduced logarithmically, proportional to the position of the result.

The DCG accumulated at a particular rank position p is defined as:
f9Hp2Tc85i3XAAAAABJRU5ErkJggg==

where rel at i is the graded relevance of the result at position i.

  • Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks.
  • Typical discount is 1/log(rank).
    • With base 2, the discount at rank 4 is 1/2, and at rank 8, it is 1/3.
  • An alternative formulation of DCG places stronger emphasis on retrieving relevant documents.


XDNSRKgM89AAAAAElFTkSuQmCC

(alternative form)

We want high weights for high-rank documents because searchers are likely to inspect them, and low weights for low-rank documents that searchers are unlikely to ever see. The discount factor is commonly chosen as log2(rank+1) and is used to divide the relevance grade. Using a logarithm for the position penalty makes the decay effect more gradual compared to using the position itself.


Search Engine Testing

Search engines have test collections of queries and hand-ranked results.

Recall is difficult to measure on the web. Search engines often use precision at top k positions or measures that reward you more for getting rank 1 right than for getting rank 10 right.

Search engines also use non-relevance-based measures, such as click-through on the first result (not reliable if you look at a single click-through, but pretty reliable in the aggregate). They also conduct studies of user behavior in the lab and A/B testing.


Google’s 4-Step Evaluation

  1. Precision Evaluations: People use the guidelines to rate search results manually.
  2. Side-by-side experiments: People are shown two different sets of search results and asked which they prefer.
  3. Live Traffic Experiments: The algorithm is altered for a small number of users.
  4. Full Launch: Final analysis by Google engineers.