Advanced IR Models 1 Advanced Information Retrieval Models ( 高级信息检索模型 ) Ben He benhe@gucas.ac.cn Modern Information Retrieval Graduate University of Chinese Academy of Sciences Advanced IR Models 2 Summary ● 1960s – early 1990s ● Boolean model ● Vector space model ● Models based on Fuzzy set ● Latent semantic indexing ● 1990s – present ● Probabilistic model (in narrow sense) ， ( 狭义 ) 概率模型 – Okapi BM25, 1994 ● Language model ( 语言模型 ), 1998 ● Divergence from Randomness (DFR) models ( 随即距离模型 ), 2002 In general sense, all the above three models belong to the category of probabilistic models Advanced IR Models 3 Probabilistic View of IR ● A collection is the population ( 总体 ): texts, paragraphs, phrases, or a bag of words ● A document is a sample ( 样本 ) from the population ● A query is a sample from the relevant documents to express the user’s information need Advanced IR Models 4 Okapi's BM25 ● Okapi: an IR system developed at the City University in London ● BM25: A classical probabilistic model proposed in early 1990s by Stephen E. Robertson ● Still widely recognized as one of the best IR models at present Advanced IR Models 5 Probabilistic Interpretation of Relevance ● Given a user query q and a document d, the probabilistic model (in narrow sense) tries to estimate the probability that the user will find the document d interesting (i.e., relevant) ● The probabilistic ranking principle (PRP): P(d ,R∣Q)= P(Q∣d ,R) P(R) P(Q) P(Q|d, R): probability a user would form this query Q given document d is relevant (R) P(R): prior probability that a document is relevant P(Q): constant for all documents; may be discarded in ranking Advanced IR Models 6 Probabilistic Ranking Principle (PRP) ● “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of usefulness to the user (. . . ) then the overall effectiveness will be the best that is obtainable on the basis of the data”. Robertson 1977 Advanced IR Models 7 Basic Probabilistic Weighting ● The basic weighting function of probabilistic model may be expressed as follows: w(x)=log2 P(x∣R) P(0∣̄R) P(x∣̄R) P(0∣R) where x is a vector of information of the document, 0 is a reference vector representing a zero-weighting document Robertson S.E., Van Rijsbergen C.J. & Porter M.F. Probabilistic models of indexing and searching. Information Retrieval Research (pp.35–56), 1981. Advanced IR Models 8 Robertson Sparck-Jones (RSJ) Formula ● If p = P(term present | R) and q = (term present | R), the resulting function is given as follows: w=log 2 p(1−q) q(1− p) “with a suitable estimation method, this becomes the RSJ formula: ” w(1)=log2 (r+ 0.5)/(R−r+ 0.5) (df −r+ 0.5)/(N −df −R+ r+ 0.5) R: number of relevant docs, r: number of relevant docs containing the term ● S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. ACM SIGIR 1994. Okapi's IDF Advanced IR Models 9 The 2-Poisson Model ● The RSJ formula considers only the presence/absence of query term t in documents ● A better approach is to take within-document frequency (tf) into account ● The 2-Poisson model: assume Poisson distribution of t in ● the entire document collection ● the elite set: relevant documents, usually approximated by documents containing t Harter S.P. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science 1975; 26:197–206 and 280–289 Advanced IR Models 10 The 2-Poisson Model ● By considering tf, the RSJ formula is replaced by: w=log p(1−q) q(1− p)=log ptf q0 qtf p0 ● Combining the 2-Poisson model with the above formula, we obtain: λ and μ are the Poisson means for tf in the elite and non-elite sets for t respectively, p' = (doc elite for t | R), and q' is the corresponding probability for R. ● Estimation problem: 4 parameters, none of which has direct evidence ● eliteness being a hidden variable Advanced IR Models 11 A Rough Model for Term Frequency ● A rearrangement of the 2-Poisson model: μ is smaller than λ. As tf → ∞, (μ/λ)^tf goes to zero, so those components drop out. e^(μ−λ) will be small, so the approximation is: w=log p' (1−q' ) q' (1− p' )∼log p(1−q) q(1− p) ● Frequent words in the collection? Advanced IR Models 12 Estimation of the 2-Poisson Model ● Basic requirements ● It is zero for tf = 0; ● It increases monotonically with tf ; ● but to an asymptotic ( 渐近的 ) maximum; ● which approximates to the RSJ weight that would be given to a direct indicator of eliteness Advanced IR Models 13 A Simple Formulation ● Based on the above requirements, a simple formulation of the 2-Poisson model is: w= tf k 1+ tf w(1) ● Basic requirements ● It is zero for tf = 0; ● It increases monotonically with tf ; ● but to an asymptotic maximum; ● which approximates to the RSJ weight that would be given to a direct indicator of eliteness Advanced IR Models 14 Query Term Weighting ● Follows a similar formulation of weighting: w= qtf k 3+ qtf ⋅ tf k 1+ tf ⋅w(1) ● Takes the relative importance of a term in the query into consideration. Advanced IR Models 15 Document Length ● The urn model ● Documents: urns ● Term t: balls with the same color ● The tf distribution concerns the probability of observing tf balls with color t in d ● The Poisson approximation of the tf distribution assumes a uniform prior distribution ● P(d)=1/N ● Unrealistic in practice Advanced IR Models 16 Okapi's TF (S. E. Robertson 1990s) TF= k 1⋅tf tf + k 1(1−b+ b l d E(l)) ● Okapi's TF ● Normalizes tf by a saturation function ld: doc length E(l): expected doc length Advanced IR Models 17 The BM25 Formula w(t ,d )= qtf k 3+ qtf ⋅ k 1⋅tf tf + k 1(1−b+ b l d E(l)) ⋅w(1) w(1): the RJS weight with R=r=0 (no relevance information available) qtf: query term frequency tf: term frequency ld: document length N: number of documents df: document frequency E(l): the expected document length, estimated by the average document length k1, k3, b: free-parameters w(1)=log2 N −df + 0.5 df + 0.5 Advanced IR Models 18 Experiments ● BM25 has led to the best run in a number of TREC tasks ● TREC-4, 5 ad-hoc tasks ● Web tracks ● TREC-2009 relevance feedback track ● etc. ● It has proven effectiveness on large-scale datasets Advanced IR Models 19 Probabilistic Model (BM25) ● Pros ● A (somewhat) theoretical model ● Two-Poisson assumption is appropriate in most IR applications ● Cons ● Requires parameter tuning (b, k1 , k3 . . . ) ● Parameter sensitivity Even though, it is widely recognized as one of the best IR models in real world applications Advanced IR Models 20 Probabilistic Model: Reading List ● Harter S.P. A probabilistic approach to automatic keyword indexing. Journal of the American Society for Information Science 1975; 26:197–206 and 280–289 ● Robertson S.E., Van Rijsbergen C.J. & Porter M.F. Probabilistic models of indexing and searching. Information Retrieval Research (pp.35–56). Butterworths London, 1981 ● Robertson S.E. and Sparck Jones K. Relevance weighting of search terms. Journal of the American Society for Information Science 1976; 27:129–146 ● S. E. Robertson and S. Walker. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. ACM SIGIR 1994. ● S. E. Robertson, S. Walker, M. M. Beaulieu, M. Gatford and A. Payne. Okapi at TREC-4. 1995.