Longman Dictionary of Contemporary English

更新时间:2023-05-07 09:04:35 阅读: 评论:0

Multinomial Randomness Models for Retrieval with
Document Fields
Vassilis Plachouras1and Iadh Ounis2
1Yahoo!Rearch,Barcelona,Spain
2University of Glasgow,Glasgow,UK
,ounis@dcs.gla.ac.uk Abstract.Documentfields,such as the title or the headings of a document,
offer a way to consider the structure of documents for retrieval.Most of the pro-
pod approaches in the literature employ either a linear combination of scores
assigned to differentfields,or a linear combination of frequencies in the term
frequency normalisation component.In the context of the Divergence From Ran-
domness framework,we have a sound opportunity to integrate documentfields
in the probabilistic randomness model.This paper introduces novel probabilis-
tic models for incorporatingfields in the retrieval process using a multinomial
randomness model and its information theoretic approximation.The evaluation
results from experiments conducted with a standard TREC Web test collection
show that the propod models perform as well as a state-of-the-artfield-bad
weighting model,while at the same time,they are theoretically founded and more
extensible than currentfield-bad models.
1Introduction
Documentfields provide a way to incorporate the structure of a document in Information Retrieval(IR)models.In the context of HTML documents,the documentfields may correspond to the contents of particular HTML tags,such as the title,or the heading tags.The anchor text of the incoming hyperlinks can also be en as a documentfield. In the ca of email documents,thefields
may correspond to the contents of the email’s subject,date,or to the email address of the nder[9].It has been shown that using documentfields for Web retrieval improves the retrieval effectiveness[17,7].
The text and the distribution of terms in a particularfield depend on the function of thatfield.For example,the titlefield provides a conci and short description for the whole document,and terms are likely to appear once or twice in a given title[6].The anchor textfield also provides a conci description of the document,but the number of terms depends on the number of incoming hyperlinks of the document.In addition, anchor texts are not always written by the author of a document,and hence,they may enrich the document reprentation with alternative terms.
The combination of evidence from the differentfields in a retrieval model requires special attention.Robertson et al.[14]pointed out that the linear combination of scores, which has been the approach mostly ud for the combination offields,is difficult to interpret due to the non-linear relation between the assigned scores and the term frequencies in each of thefields.Hawking et al.[5]showed that the term frequency G.Amati,C.Carpineto,and G.Romano(Eds.):ECIR2007,LNCS4425,pp.28–39,2007.
c Springer-Verlag Berlin Heidelberg2007
Multinomial Randomness Models for Retrieval with Document Fields 29
normalisation applied to each field depends on the nature of the corresponding field.Zaragoza et al.[17]introduced a field-bad version of BM25,called BM25F,which applies term frequency normalisation and weighting of the fields independently.Mac-donald et al.[7]also introduced normalisation 2F in the Divergence From Randomness (DFR)framework [1]for performing independent term frequency normalisation and weighting of fields.In both cas of BM25F and the DFR models that employ normali-sation 2F,there is the assumption that the occurrences of terms in the fields follow the same distribution,becau the combination of fields takes place in the term frequency normalisation component,and not in the probabilistic weighting model.
In this work,we introduce weighting models,where the combination of evidence from the different fields does not take place in the term frequency normalisation part of the model,but instead,it constitutes an integral part of the probabilistic randomness model.We propo two DFR weighting models that combine the evidence from the different fields using a multinomial distribution,and its information theoretic approx-imation.We evaluate the performance of the introduced weighting models using the standard .Gov TREC Web test collection.We show that the models perform as well as the state-of-the-art model field-bad PL2F,while at the same time,they employ a theoretically fou
nded and more extensible combination of evidence from fields.
The remainder of this paper is structured as follows.Section 2provides a description of the DFR framework,as well as the related field-bad weighting models.Section 3introduces the propod multinomial DFR weighting models.Section 4prents the evaluation of the propod weighting models with a standard Web test collection.Sec-tions 5and 6clo the paper with a discussion related to the propod models and the obtained results,and some concluding remarks drawn from this work,respectively.2Divergence from Randomness Framework and Document Fields The Divergence From Randomness (DFR)framework [1]generates a family of prob-abilistic weighting models for IR.It provides a great extent of flexibility in the n that the generated models are modular,allowing for the evaluation of new assumptions in a principled way.The remainder of this ction provides a description of the DFR framework (Section 2.1),as well as a brief description of the combination of evidence from different document fields in the context of the DFR framework (Section 2.2).
2.1DFR Models
The weighting models of the Divergence From Randomness framework are bad on combinations
of three components:a randomness model RM ;an information gain model GM ;and a term frequency normalisation model.
Given a collection D of documents,the randomness model RM estimates the probability P RM (t ∈d |D )of having tf occurrences of a term t in a document d ,and the importance of t in d corresponds to the informative content −log 2(P RM (t ∈d |D )).Assuming that the sampling of terms corresponds to a quence of independent Bernoulli trials,the randomness model RM is the binomial distribution:P B (t ∈d |D )= T F tf
p tf (1−p )T F −tf (1)
30V .Plachouras and I.Ounis
where TF is the frequency of t in the collection D ,p =1N is a uniform prior probability
that the term t appears in the document d ,and N is the number of documents in the collection D .A limiting form of the binomial distribution is the Poisson distribution P :
P B (t ∈d |D )≈P P (t ∈d |D )=λtf tf !e −λwhere λ=T F ·p =T F
N (2)
The information gain model GM estimates the informative content 1−P risk of the probability P risk that a term t is a good descriptor for a document.When a term t appears many times in a document,then there is very low risk in assuming that t describes the document.The information gain,however,from any future occurrences of t in d is lower.For example,the term ‘evaluation’is likely to have a high frequency in a document about the evaluation of IR systems.After the first few occurrences of the term,however,each additional occurrence of the term ‘evaluation’provides a diminishing additional amount of information.One model to compute the probability P risk is the Laplace after-effect model:
P risk =tf tf +1
(3)P risk estimates the probability of having one more occurrence of a term in a document,after having en tf occurrences already.
The third component of the DFR framework is the term frequency normalisation model,which adjusts the frequency tf of the term t in d ,given the length l of d and the average document length l in D .Normalisation 2assumes a decreasing density function of the normalid term frequency with respect to the document length l .The normalid term frequency tfn is given as follows:
tfn =tf ·log 2(1+c ·l l )(4)
where c is a a tunable parameter.Normalisation 2is employed in the framework by replacing tf in Equations (2)and (3)with tfn .
The relevance score w d,q of a document d for a query q is given by:
w d,q =
t ∈q
qtw ·w d,t where w d,t =(1−P risk )·(−log 2P RM )(5)
where w d,t is the weight of the term t in document d ,qtw =qtf qtf max ,qtf is the frequency of t in the query q ,and qtf max is the maximum qtf in q .If P RM is estimated
using the Poisson randomness model,P risk is estimated using the Laplace after-effect model,and tfn is computed according to normalisation 2,then the resulting weight-ing model is denoted
by PL2.The factorial is approximated using Stirling’s formula:tf !=√2π·tf
tf +0.5e −tf .The DFR framework generates a wide range of weighting models by using different randomness models,information gain models,or term frequency normalisation models.For example,the next ction describes how normalisation 2is extended to handle the normalisation and weighting of term frequencies for different document fields.
Multinomial Randomness Models for Retrieval with Document Fields31 2.2DFR Models for Document Fields
The DFR framework has been extended to handle multiple documentfields,and to apply per-field term frequency normalisation and weighting.This is achieved by ex-tending normalisation2,and introducing normalisation2F[7],which is explained below.
Suppo that a document has kfields.Each occurrence of a term can be assigned to exactly onefield.The frequency tf i of term t in the i-thfield is normalid and weighted independently of the otherfields.Then,the normalid and weighted term frequencies are combined into one pudo-frequency tfn2F:
tfn2F=
k
i=1
w i·tf i log2
1+c i·
l i
l i
(6)
where w i is the relative importance or weight of the i-thfield,tf i is the frequency of t in the i-thfield of document d,l i is the length of the i-thfield in d,l i is the average length of the i-thfield in the collection D,and c i is a hyperparameter for the i-thfield.The above formula corresponds to normalisation2F.The weighting model PL2F corresponds to PL2using tfn2F as given in Equation(6).The well-known BM25 weighting model has also been extended in a similar way to BM25F[17].
3Multinomial Randomness Models
This ction introduces DFR models which,instead of extending the term frequency normalisation component,as described in the previous ction,u documentfields as part of the randomness model.While the weighting model PL2F has been shown to perform particularly well[7,8],the documentfields are not an integral part of the ran-domness weighting model.Indeed,the combination of evidence from the differentfields takes place as a linear combination of normalid frequencies in the term frequency nor-malisation component.This implies that the term frequencies are drawn from the same distribution,even though the nature of eachfield may be different.
We propo two weighting models,which,instead of assuming that term frequen-cies infields are drawn from the same distribution,u multinomial distributions to incorporate documentfields in a theoretically driven way.Thefirst one is bad on the multinomial distribution(Section3.1),and the cond one is bad on an information theoretic approximation of the multinomial distribution(Section3.2).
3.1Multinomial Distribution
We employ the multinomial distribution to compute the probability that a term appears a given number of times in each of thefields of a document.The formula of the weighting model is derived as
follows.Suppo that a document d has kfields.The probability that a term occurs tf i times in the i-thfield f i,is given as follows:
P M(t∈d|D)=
T F
p p tf k
k
p tf (7)
32V .Plachouras and I.Ounis
In the above equation,T F is the frequency of term t in the collection,p i =1k ·N is the prior probability that a term occurs in a particular field of document d ,and N is the number of documents in the collection D .The frequency tf  =T F − k
i =1tf i cor-
responds to the number of occurrences of t in other documents than d .The probability p  =1−k 1k ·N =N −1N corresponds to the probability that t does not appear in any of the fields of d .
The DFR weighting model is generated using the multinomial distribution from Equation (7)as a randomness model,the Laplace after-effect from Equation (3),and replacing tf i with the normalid term frequency tfn i ,obtained by applying normal-isation 2from Equation (4).The relevance score of a document d for a query q is computed as follows:w d,q = t ∈q qtw ·w d,t = t ∈q
qtw ·(1−P risk )· −log 2(P M (t ∈d |D )
=
t ∈q qtw  k i =1tfn i +1· −log 2(T F !)+k  i =1 log 2(tfn i !)−tfn i log 2(p i ) +log 2(tfn  !)−tfn  log 2(p  ) (8)
where qtw is the weight of a term t in query q ,tfn  =T F − k i =1tfn i ,tfn i =
tf i ·log 2(1+c i ·l
i l i )for the i -th field,and c i is the hyperparameter of normalisation 2
for the i -th field.The weighting model introduced in the above equation is denoted by ML2,where M stands for the multinomial randomness model,L stands for the Laplace after-effect model,and 2stands for normalisation 2.
Before continuing,it is interesting to note two issues related to the introduced weight-ing model ML2,namely tting the relative importance,or weight,of fields in the do-cument reprentation,and the computation of factorials.
Weights of fields.In Equation (8),there are two different ways to incorporate weights for the fields of documents.The first one is to multiply each of the normalid term frequencies tfn i with a constant w i ,in a similar way to normalisation 2F (e Equa-tion (6)):tfn i :=w i ·tfn i .The cond way is to adjust the prior probabilities p i of fields,in order to increa the scores assigned to terms occurring in fields with low prior probabilities:p i :=p i w i .Indeed,the assigned score to a query term occurring in a field with low probability is high,due to the factor −tfn i log 2(p i )in Equation (8).
Computing factorials.As mentioned in Section 2.1,the factorial in the weighting model PL2is approximated using Stirling’s formula.A different method to approximate the factorial is to u the approximation of Lanczos to the Γfunction [12,p.213],which has a lower approximation error than Stir
ling’s formula.Indeed,preliminary experi-mentation with ML2has shown that using Stirling’s formula affects the performance of the weighting model,due to the accumulation of the approximation error from com-puting the factorial k +2times (k is the number of fields).This is not the ca for the Poisson-bad weighting models PL2and PL2F,where there is only one factorial com-putation for each query term (e Equation (2)).Hence,the computation of factorials in Equation (8)is performed using the approximation of Lanczos to the Γfunction.
Multinomial Randomness Models for Retrieval with Document Fields33 3.2Approximation to the Multinomial Distribution
The DFR framework generates different models by replacing the binomial randomness model with its limiting forms,such as the Poisson randomness model.In this ction, we introduce a new weighting model by replacing the multinomial randomness model in ML2with the following information theoretic approximation[13]:
T F!
tf1!tf2!···tf k!tf !p1tf1p2tf2···p k tf k p tf ≈
1
2πT F k
2−T F·D
tf i
T F
,p i
p t1p t2···p tk p t
(9)
D
tf i
T F
,
p i
corresponds to the information theoretic divergence of the probability p ti=
tf i
T F
that a term occurs in afield,from the prior probability p i of thefield:
D  tf
i
T F
,p i
=
k
i=1
tf
i
T F
log2
tf i
T F·p i
+
tf
T F
log2
tf
T F·p
(10)
where tf =T F− k
i=1
tf i.Hence,the multinomial randomness model M in the
weighting model ML2can be replaced by its approximation from Equation(9):
w d,q=
t∈q qtw·
k
2
log2(2πT F)
k
i=1
tfn i+1
·
k
i=1
tfn i log2
tfn i/T F
p i
+
1
2
log2
tfn i
T F
+tfn log2
tfn /T F
p
+
1
2
log2
tfn
T F
(11)
The above model is denoted by M D L2.The definitions of the variables involved in the
above equation have been introduced in Section3.1.
It should be noted that the information theoretic divergence D
tf i
T F
,p i
is defined
only when tf i>0for1≤i≤k.In other words,D
tf i
T F
,p i
is defined only when
there is at least one occurrence of a query term in all thefields.This is not always the ca,becau a Web document may contain all the query terms in its body,but it may contain only some of the query terms in its title.To overcome this issue,the weight of a query term t in a document is computed by considering only thefields in which the term t appears.
The weights of differentfields can be defined in the same way as in the ca of the weighting model ML2,as described in Section3.1.In more detail,the weighting of fields can be achieved by either multiplying the frequency of a term in afield by a constant,or by adjusting the prior probability of the correspondingfield.
An advantage of the weighting model M D L2is that,becau it approximates the multinomial distribution,there is no need to compute factorials.Hence,it is likely to provide a sufficiently accurate approximation to the multinomial distribution,and it may lead to improved retrieval effectiveness compared to ML2,due to the lower accu-mulated numerical errors.The experimental results in Section4.2will indeed confirm this advantage of M D L2.

本文发布于:2023-05-07 09:04:35,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/547012.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图