首页 > 美文阅读

Longman Dictionary of Contemporary English

更新时间:2023-05-07 09:04:35 阅读：评论：0

Multinomial Randomness Models for Retrieval with

Document Fields

Vassilis Plachouras1and Iadh Ounis2

1Yahoo!Rearch,Barcelona,Spain

2University of Glasgow,Glasgow,UK

,ounis@dcs.gla.ac.uk Abstract.Documentﬁelds,such as the title or the headings of a document,

offer a way to consider the structure of documents for retrieval.Most of the pro-

pod approaches in the literature employ either a linear combination of scores

assigned to differentﬁelds,or a linear combination of frequencies in the term

frequency normalisation component.In the context of the Divergence From Ran-

domness framework,we have a sound opportunity to integrate documentﬁelds

in the probabilistic randomness model.This paper introduces novel probabilis-

tic models for incorporatingﬁelds in the retrieval process using a multinomial

randomness model and its information theoretic approximation.The evaluation

results from experiments conducted with a standard TREC Web test collection

show that the propod models perform as well as a state-of-the-artﬁeld-bad

weighting model,while at the same time,they are theoretically founded and more

extensible than currentﬁeld-bad models.

1Introduction

Documentﬁelds provide a way to incorporate the structure of a document in Information Retrieval(IR)models.In the context of HTML documents,the documentﬁelds may correspond to the contents of particular HTML tags,such as the title,or the heading tags.The anchor text of the incoming hyperlinks can also be en as a documentﬁeld. In the ca of email documents,theﬁelds

may correspond to the contents of the email’s subject,date,or to the email address of the nder[9].It has been shown that using documentﬁelds for Web retrieval improves the retrieval effectiveness[17,7].

The text and the distribution of terms in a particularﬁeld depend on the function of thatﬁeld.For example,the titleﬁeld provides a conci and short description for the whole document,and terms are likely to appear once or twice in a given title[6].The anchor textﬁeld also provides a conci description of the document,but the number of terms depends on the number of incoming hyperlinks of the document.In addition, anchor texts are not always written by the author of a document,and hence,they may enrich the document reprentation with alternative terms.

The combination of evidence from the differentﬁelds in a retrieval model requires special attention.Robertson et al.[14]pointed out that the linear combination of scores, which has been the approach mostly ud for the combination ofﬁelds,is difﬁcult to interpret due to the non-linear relation between the assigned scores and the term frequencies in each of theﬁelds.Hawking et al.[5]showed that the term frequency G.Amati,C.Carpineto,and G.Romano(Eds.):ECIR2007,LNCS4425,pp.28–39,2007.

c Springer-Verlag Berlin Heidelberg2007

Multinomial Randomness Models for Retrieval with Document Fields 29

normalisation applied to each ﬁeld depends on the nature of the corresponding ﬁeld.Zaragoza et al.[17]introduced a ﬁeld-bad version of BM25,called BM25F,which applies term frequency normalisation and weighting of the ﬁelds independently.Mac-donald et al.[7]also introduced normalisation 2F in the Divergence From Randomness (DFR)framework [1]for performing independent term frequency normalisation and weighting of ﬁelds.In both cas of BM25F and the DFR models that employ normali-sation 2F,there is the assumption that the occurrences of terms in the ﬁelds follow the same distribution,becau the combination of ﬁelds takes place in the term frequency normalisation component,and not in the probabilistic weighting model.

In this work,we introduce weighting models,where the combination of evidence from the different ﬁelds does not take place in the term frequency normalisation part of the model,but instead,it constitutes an integral part of the probabilistic randomness model.We propo two DFR weighting models that combine the evidence from the different ﬁelds using a multinomial distribution,and its information theoretic approx-imation.We evaluate the performance of the introduced weighting models using the standard .Gov TREC Web test collection.We show that the models perform as well as the state-of-the-art model ﬁeld-bad PL2F,while at the same time,they employ a theoretically fou

nded and more extensible combination of evidence from ﬁelds.

The remainder of this paper is structured as follows.Section 2provides a description of the DFR framework,as well as the related ﬁeld-bad weighting models.Section 3introduces the propod multinomial DFR weighting models.Section 4prents the evaluation of the propod weighting models with a standard Web test collection.Sec-tions 5and 6clo the paper with a discussion related to the propod models and the obtained results,and some concluding remarks drawn from this work,respectively.2Divergence from Randomness Framework and Document Fields The Divergence From Randomness (DFR)framework [1]generates a family of prob-abilistic weighting models for IR.It provides a great extent of ﬂexibility in the n that the generated models are modular,allowing for the evaluation of new assumptions in a principled way.The remainder of this ction provides a description of the DFR framework (Section 2.1),as well as a brief description of the combination of evidence from different document ﬁelds in the context of the DFR framework (Section 2.2).

2.1DFR Models

The weighting models of the Divergence From Randomness framework are bad on combinations

of three components:a randomness model RM ;an information gain model GM ;and a term frequency normalisation model.

Given a collection D of documents,the randomness model RM estimates the probability P RM (t ∈d |D )of having tf occurrences of a term t in a document d ,and the importance of t in d corresponds to the informative content −log 2(P RM (t ∈d |D )).Assuming that the sampling of terms corresponds to a quence of independent Bernoulli trials,the randomness model RM is the binomial distribution:P B (t ∈d |D )= T F tf

p tf (1−p )T F −tf (1)

30V .Plachouras and I.Ounis

where TF is the frequency of t in the collection D ,p =1N is a uniform prior probability

that the term t appears in the document d ,and N is the number of documents in the collection D .A limiting form of the binomial distribution is the Poisson distribution P :

P B (t ∈d |D )≈P P (t ∈d |D )=λtf tf !e −λwhere λ=T F ·p =T F

N (2)

The information gain model GM estimates the informative content 1−P risk of the probability P risk that a term t is a good descriptor for a document.When a term t appears many times in a document,then there is very low risk in assuming that t describes the document.The information gain,however,from any future occurrences of t in d is lower.For example,the term ‘evaluation’is likely to have a high frequency in a document about the evaluation of IR systems.After the ﬁrst few occurrences of the term,however,each additional occurrence of the term ‘evaluation’provides a diminishing additional amount of information.One model to compute the probability P risk is the Laplace after-effect model:

P risk =tf tf +1

(3)P risk estimates the probability of having one more occurrence of a term in a document,after having en tf occurrences already.

The third component of the DFR framework is the term frequency normalisation model,which adjusts the frequency tf of the term t in d ,given the length l of d and the average document length l in D .Normalisation 2assumes a decreasing density function of the normalid term frequency with respect to the document length l .The normalid term frequency tfn is given as follows:

tfn =tf ·log 2(1+c ·l l )(4)

where c is a a tunable parameter.Normalisation 2is employed in the framework by replacing tf in Equations (2)and (3)with tfn .

The relevance score w d,q of a document d for a query q is given by:

w d,q =

t ∈q

qtw ·w d,t where w d,t =(1−P risk )·(−log 2P RM )(5)

where w d,t is the weight of the term t in document d ,qtw =qtf qtf max ,qtf is the frequency of t in the query q ,and qtf max is the maximum qtf in q .If P RM is estimated

using the Poisson randomness model,P risk is estimated using the Laplace after-effect model,and tfn is computed according to normalisation 2,then the resulting weight-ing model is denoted

by PL2.The factorial is approximated using Stirling’s formula:tf !=√2π·tf

tf +0.5e −tf .The DFR framework generates a wide range of weighting models by using different randomness models,information gain models,or term frequency normalisation models.For example,the next ction describes how normalisation 2is extended to handle the normalisation and weighting of term frequencies for different document ﬁelds.

Multinomial Randomness Models for Retrieval with Document Fields31 2.2DFR Models for Document Fields

The DFR framework has been extended to handle multiple documentﬁelds,and to apply per-ﬁeld term frequency normalisation and weighting.This is achieved by ex-tending normalisation2,and introducing normalisation2F[7],which is explained below.

Suppo that a document has kﬁelds.Each occurrence of a term can be assigned to exactly oneﬁeld.The frequency tf i of term t in the i-thﬁeld is normalid and weighted independently of the otherﬁelds.Then,the normalid and weighted term frequencies are combined into one pudo-frequency tfn2F:

tfn2F=

i=1

w i·tf i log2

1+c i·

l i

(6)

where w i is the relative importance or weight of the i-thﬁeld,tf i is the frequency of t in the i-thﬁeld of document d,l i is the length of the i-thﬁeld in d,l i is the average length of the i-thﬁeld in the collection D,and c i is a hyperparameter for the i-thﬁeld.The above formula corresponds to normalisation2F.The weighting model PL2F corresponds to PL2using tfn2F as given in Equation(6).The well-known BM25 weighting model has also been extended in a similar way to BM25F[17].

3Multinomial Randomness Models

This ction introduces DFR models which,instead of extending the term frequency normalisation component,as described in the previous ction,u documentﬁelds as part of the randomness model.While the weighting model PL2F has been shown to perform particularly well[7,8],the documentﬁelds are not an integral part of the ran-domness weighting model.Indeed,the combination of evidence from the differentﬁelds takes place as a linear combination of normalid frequencies in the term frequency nor-malisation component.This implies that the term frequencies are drawn from the same distribution,even though the nature of eachﬁeld may be different.

We propo two weighting models,which,instead of assuming that term frequen-cies inﬁelds are drawn from the same distribution,u multinomial distributions to incorporate documentﬁelds in a theoretically driven way.Theﬁrst one is bad on the multinomial distribution(Section3.1),and the cond one is bad on an information theoretic approximation of the multinomial distribution(Section3.2).

3.1Multinomial Distribution

We employ the multinomial distribution to compute the probability that a term appears a given number of times in each of theﬁelds of a document.The formula of the weighting model is derived as

follows.Suppo that a document d has kﬁelds.The probability that a term occurs tf i times in the i-thﬁeld f i,is given as follows:

P M(t∈d|D)=

T F

p p tf k

p tf (7)

32V .Plachouras and I.Ounis

In the above equation,T F is the frequency of term t in the collection,p i =1k ·N is the prior probability that a term occurs in a particular ﬁeld of document d ,and N is the number of documents in the collection D .The frequency tf =T F − k

i =1tf i cor-

responds to the number of occurrences of t in other documents than d .The probability p =1−k 1k ·N =N −1N corresponds to the probability that t does not appear in any of the ﬁelds of d .

The DFR weighting model is generated using the multinomial distribution from Equation (7)as a randomness model,the Laplace after-effect from Equation (3),and replacing tf i with the normalid term frequency tfn i ,obtained by applying normal-isation 2from Equation (4).The relevance score of a document d for a query q is computed as follows:w d,q = t ∈q qtw ·w d,t = t ∈q

qtw ·(1−P risk )· −log 2(P M (t ∈d |D )

t ∈q qtw k i =1tfn i +1· −log 2(T F !)+k i =1 log 2(tfn i !)−tfn i log 2(p i ) +log 2(tfn !)−tfn log 2(p ) (8)

where qtw is the weight of a term t in query q ,tfn =T F − k i =1tfn i ,tfn i =

tf i ·log 2(1+c i ·l

i l i )for the i -th ﬁeld,and c i is the hyperparameter of normalisation 2

for the i -th ﬁeld.The weighting model introduced in the above equation is denoted by ML2,where M stands for the multinomial randomness model,L stands for the Laplace after-effect model,and 2stands for normalisation 2.

Before continuing,it is interesting to note two issues related to the introduced weight-ing model ML2,namely tting the relative importance,or weight,of ﬁelds in the do-cument reprentation,and the computation of factorials.

Weights of ﬁelds.In Equation (8),there are two different ways to incorporate weights for the ﬁelds of documents.The ﬁrst one is to multiply each of the normalid term frequencies tfn i with a constant w i ,in a similar way to normalisation 2F (e Equa-tion (6)):tfn i :=w i ·tfn i .The cond way is to adjust the prior probabilities p i of ﬁelds,in order to increa the scores assigned to terms occurring in ﬁelds with low prior probabilities:p i :=p i w i .Indeed,the assigned score to a query term occurring in a ﬁeld with low probability is high,due to the factor −tfn i log 2(p i )in Equation (8).

Computing factorials.As mentioned in Section 2.1,the factorial in the weighting model PL2is approximated using Stirling’s formula.A different method to approximate the factorial is to u the approximation of Lanczos to the Γfunction [12,p.213],which has a lower approximation error than Stir

ling’s formula.Indeed,preliminary experi-mentation with ML2has shown that using Stirling’s formula affects the performance of the weighting model,due to the accumulation of the approximation error from com-puting the factorial k +2times (k is the number of ﬁelds).This is not the ca for the Poisson-bad weighting models PL2and PL2F,where there is only one factorial com-putation for each query term (e Equation (2)).Hence,the computation of factorials in Equation (8)is performed using the approximation of Lanczos to the Γfunction.

Multinomial Randomness Models for Retrieval with Document Fields33 3.2Approximation to the Multinomial Distribution

The DFR framework generates different models by replacing the binomial randomness model with its limiting forms,such as the Poisson randomness model.In this ction, we introduce a new weighting model by replacing the multinomial randomness model in ML2with the following information theoretic approximation[13]:

T F!

tf1!tf2!···tf k!tf !p1tf1p2tf2···p k tf k p tf ≈

√

2πT F k

2−T F·D

tf i

T F

,p i

p t1p t2···p tk p t

(9)

tf i

T F

p i

corresponds to the information theoretic divergence of the probability p ti=

tf i

T F

that a term occurs in aﬁeld,from the prior probability p i of theﬁeld:

D tf

T F

,p i

i=1

T F

log2

tf i

T F·p i

T F

log2

T F·p

(10)

where tf =T F− k

i=1

tf i.Hence,the multinomial randomness model M in the

weighting model ML2can be replaced by its approximation from Equation(9):

w d,q=

t∈q qtw·

log2(2πT F)

i=1

tfn i+1

i=1

tfn i log2

tfn i/T F

p i

log2

tfn i

T F

+tfn log2

tfn /T F

log2

tfn

T F

(11)

The above model is denoted by M D L2.The deﬁnitions of the variables involved in the

above equation have been introduced in Section3.1.

It should be noted that the information theoretic divergence D

tf i

T F

,p i

is deﬁned

only when tf i>0for1≤i≤k.In other words,D

tf i

T F

,p i

is deﬁned only when

there is at least one occurrence of a query term in all theﬁelds.This is not always the ca,becau a Web document may contain all the query terms in its body,but it may contain only some of the query terms in its title.To overcome this issue,the weight of a query term t in a document is computed by considering only theﬁelds in which the term t appears.

The weights of differentﬁelds can be deﬁned in the same way as in the ca of the weighting model ML2,as described in Section3.1.In more detail,the weighting of ﬁelds can be achieved by either multiplying the frequency of a term in aﬁeld by a constant,or by adjusting the prior probability of the correspondingﬁeld.

An advantage of the weighting model M D L2is that,becau it approximates the multinomial distribution,there is no need to compute factorials.Hence,it is likely to provide a sufﬁciently accurate approximation to the multinomial distribution,and it may lead to improved retrieval effectiveness compared to ML2,due to the lower accu-mulated numerical errors.The experimental results in Section4.2will indeed conﬁrm this advantage of M D L2.

本文发布于:2023-05-07 09:04:35，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/547012.html

上一篇：大象版小学四年级科学上册教学计划(5篇)

下一篇：2023年仿真实验心得与体会(4篇)

标签：

留言与评论（共有 0 条评论）