Sentence Extraction for Legal Text Summarisation Ben Hachey and Claire Grover
University of Edinburgh
School of Informatics
leak2Buccleuch Place
Edinburgh EH89LW,UK
{bhachey,grover}@inf.ed.ac.uk
Abstract
We describe a system for generating extractive
summaries of texts in the legal domain,focus-
children是什么意思ing on the relevance classifier,which determines
which ntences are abstract-worthy.We experi-
星期三的英语单词
builtment with na¨ıve Bayes and maximum entropy es-
timation toolkits and explore methods for lecting
abstract-worthy ntences in rank order.Evaluation
using standard accuracy measures and using corre-
lation confirm the utility of our approach,but sug-
rainbow什么意思gest different optimal configurations.
1Introduction
季札挂剑翻译
In the SUM project we are developing a system for summaris-ing legal judgments that is generic and portable and which maintains a mechanism to account for the rhetorical struc-ture of the argumentation of a ca.Following Teufel and Moens[2002],we are developing a text extraction system that retains aflavour of the fact extraction approach.This is achieved by combining ntence lection with information about why a certain ntence is extracted—e.g.is it part of a judge’s argumentation,or does it contain a decision regarding the disposal of the ca?In this way we are abl
e to produce flexible summaries of varying length and for various audi-ences.Sentences can be reordered,since they have rhetorical roles associated with them,or they can be suppresd if a ur is not interested in certain types of rhetorical roles.
We have prepared a new corpus of UK Hou of Lords judgments(HOLJ)for this work which contains two layers of manual annotation:rhetorical role and relevance.The rhetor-ical roles reprent the ntence’s contribution to the overall communicative goal of the document.In the ca of HOLJ texts,the communicative goal for each lord is to convince their peers of the soundness of their argument.In the current version of the corpus there are69judgments which have been annotated for rhetorical role.The cond manual layer is an-notation of ntences for‘relevance’as measured by whether they match ntences in hand-written summaries.In the cur-rent version of the corpus,47of the69judgments which have been annotated for rhetorical role have also been annotated for relevance.A third layer of annotation is automatic linguis-tic annotation,which provides the features which are ud by the rhetorical role and relevance classifiers.2Classification and Relevance
Following from[Kupiec et al.,1995],machine learning has been the standard approach to text extraction summarisation as it provides an empirical method for combining different in-formation sour
恒星日与太阳日ces about the textual unit under consideration. For relevance prediction,we performed experiments with publicly available na¨ıve Bayes(NB)and maximum entropy (ME)estimation toolkits.The NB implementation,found in the Weka toolkit,is bad on John and Langley’s[1995]al-gorithm incorporating statistical methods for nonparametric density estimation of continuous variables.The ME estima-tion toolkit,written by Zhang Le,contains a C++implemen-tation of the LMVM[Malouf,2002]estimation algorithm. For ME,we u the Weka implementation of Fayyad and Irani’s[1993]MDL algorithm to discreti numeric features. The features that we have been experimenting with for the HOLJ corpus are broadly similar to tho ud by Teufel and Moens[2002].They consist of location features encoding the position of the ntence in document,speech and paragraph;
a thematic words feature encoding the average tf*idf weight of the ntence terms;a ntence length feature encoding the number of tokens in the ntence;quotation features en-coding percentage of ntence tokens inside and in-line quote and whether or not the ntence is inside a block quote;entity features encoding the prence or abnce of named entities in the ntence;and cue phra features.
suitcaThe term‘cue phra’covers the kinds of stock phras which are frequently good indicators of rhetor
ical phras such as The aim of this study in the scientific arti-cle domain and It ems to me that in the HOLJ domain). Teufel and Moens invested a considerable amount of effort in building hand-crafted lexicons where the cue phras are assigned to one of a number offixed categories.A primary aim of the current rearch is to investigate whether this infor-mation can be encoded using automatically computable lin-guistic features.If they can,then this helps to relieve the burden involved in porting systems such as the to new do-mains.Our preliminary cue phra feature t includes syn-tactic features of the main verb(voice,ten,aspect,modal-ity,negation).We also u ntence initial part-of-speech and ntence initial word features to roughly approximate formu-laic expressions which are ntence-level adverbial or prepo-sitional phras.Subject features include the head lemma, entity type,and entity subtype.The features approximate
NB ME
P R F P R F Cue34.921.526.666.615.224.8
Entities30.726.428.466.815.425.1
Them.Words32.226.929.368.615.725.5
Location31.627.229.273.416.426.9
Quotations31.227.729.471.717.428.0
Sent.Length31.729.429.871.416.927.3 Table1:Accuracy measures for yes predictions.
dot dotthe hand-coded agent features of Teufel and Moens.A main verb lemma feature simulates Teufel and Moens’s type of ac-tion and a feature encoding the part-of-speech after the main verb is meant to capture basic subcategorisation information. 3Experimental Results
Table1contains cumulative precision(P),recall(R)and f-scores(F)for the na¨ıve Bayes(NB)and maximum entropy (ME)classifiers on the relevance classification task.1Though only the cue phra feature t performs well individually, all feature ts contribute positively to the cumulative scores with the exception of ntence length for ME and quotation for NB.Both classifiers perform significantly better than a baline created by lecting ntences from the end of the document,which obtains P,R and F scores of46.7,16.0and 23.8.F-scores for the best feature combinations are similar to the partial results reported in Teufel and Moens[2002]. Taking the f-score as the best metric to optimi would lead us to choo NB.
However,a basic aspect of summarisation system design, especially a system that needs to beflexible enough to suit various ur types,is that the size of the summary will be vari-able.For insta
nce,students may need a20ntence summary containing,for example,quite detailed background informa-tion,to get the same information a judge would get from a10 ntence summary.Furthermore,any given ur might want to request a longer summary for a certain document.So,what we actually want to do is rate how relevant/extract-worthy a ntence is in such a way that will allow us to lect ntences in rank order.Bearing this in mind,precision is probably the more important metric given that recall will be controlled by the size of the summary.So,ME with all but ntence length features actually appears to be the better approach for n-tence extraction.
Since we need a ranking rather than a yes/no classification, this might actually be considered a regression task.However, due to the way the corpus was annotated,the target attribute is in fact binary.As both of our classifiers are probabilistic, we u p(y=yes| x)as a way to rank ntences.To evaluate the ranking methods with respect to our binary gold standard, we u the point-birial correlation coefficient(r pb).Table 2contains correlation coefficients between the gold standard yes/no classification and p(y=yes| x)for na¨ıve Bayes(NB) 1Note that this is a strict evaluation that counts only yes predic-tions.Micro-and macro-averaging over yes and no predictions f-scores of87.6and67.3respectively for ME.
NB ME
I C I C
Cue0.1870.1870.2080.208
Entities0.1030.2110.0560.219
Them.Words0.0160.2110.0000.227
Location0.1040.229-0.0310.166
Quotations0.0920.2330.0930.187
Sent.Length0.0690.2350.0000.175
hodo
Table2:Point-birial correlation coefficients.
and maximum entropy(ME).2The I column has scores for the individual feature ts and the C column has cumula-tive scores.The correlation results are strikingly different for NB and ME.While NB successfully incorporates all features (r pb=0.235),ME performs best using only cue phra,entity and thematic word features(r pb=0.208).For ME,the loca-tion feature t actually gives a negative correlation.Judging by the results,we would again be likely to choo NB.
4Conclusions and Future Work
In this paper,we have prented work on the automatic sum-marisation of legal texts for which we have compiled a new corpus with annotation of rhetorical status,relevance and lin-guistic markup.We prented ntence extraction results in classification and ranking frameworks.Na¨ıve Bayes and maximum entropy classifiers achieve significant improve-ments over the baline according to standard accuracy mea-sures.We have also ud the point-birial correlation coeffi-cient for quantitative evaluation of our extraction system,the results of which suggest diffierent optimal configurations.In current work,we are developing a ur study that will help determine empirically whether correlation coefficients are a better evaluation metric than precision and recall accuracy measures.
References
[Fayyad and Irani,1993]U.Fayyad and K.Irani.Multi-interval discretization of continuous-valued attributes for classification learning.In IJCAI,1993.
[John and Langley,1995]G.H.John and P.Langley.Esitmating continuous distributions in bayesian classifiers.In UAI,1995. [Kupiec et al.,1995]J.Kupiec,J.Pedern,and F.Chen.A train-able document s
ummarizer.In SIGIR,pages68–73,1995. [Malouf,2002]R.Malouf.A comparison of algorithms for maxi-mum entropy parameter estimation.In CoNLL,2002. [Teufel and Moens,2002]S.Teufel and M.Moens.Summarising scientific articles-experiments with relevance and rhetorical sta-tus.Computational Linguistics,28(4):409–445,2002.
[Wolf and Gibson,2004] F.Wolf and E.Gibson.Paragraph-,word-,and coherence-bad approaches to ntence ranking:A com-parison of algorithm and human performance.In ACL,2004.
2It has been argued that this is actually a better evaluation than standard accuracy measures,which do not account for degree of agreement[Wolf and Gibson,2004].