Archive for the ‘Search Engines’ Category

The BM25 Weighting Scheme

Monday, June 2nd, 2008

http://en.wikipedia.org/wiki/BM25

http://xapian.org/docs/bm25.html

Okapi BM25 is a ranking function used by search engines to rank matching documents according to their relevance to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

The name of the actual ranking function is BM25. To set the right context, however, it usually referred to as “Okapi BM25″, since the Okapi information retrieval system, implemented at London’s City University in the 1980s and 1990s, was the first system to implement this function.

BM25, and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent state-of-the-art retrieval functions used in document retrieval, such as Web search.

Levenshtein Distance

Monday, June 2nd, 2008

http://en.wikipedia.org/wiki/Levenshtein_distance

In information theory and computer science, the Levenshtein distance is a metric for measuring the amount of difference between two sequences (i.e., the so called edit distance). The Levenshtein distance between two strings is given by the minimum number of operations needed to transform one string into the other, where an operation is an insertion, deletion, or substitution of a single character. A generalization of the Levenshtein distance (Damerau-Levenshtein distance) allows the transposition of two characters as an operation.

The metric is named after Vladimir Levenshtein, who considered this distance in 1965.[1] It is often used in applications that need to determine how similar, or different, two strings are, such as spell checkers.

tf–idf (term frequency–inverse document frequency)

Monday, June 2nd, 2008

http://xapian.org/docs/bm25.html

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document’s relevance given a user query.