An Entity-Driven Framework for Abstractive Summarization
Eva Sharma 1*Luyang Huang 2*Zhe Hu 1*and Lu Wang 1
1
Khoury College of Computer Sciences,Northeastern University,Boston,MA 021152
Department of Electrical and Computer Engineering,Boston University,Boston,MA 022151
u.edu,hu.u.edu,u.edu
2
lyhuang@bu.edu
Abstract
Abstractive summarization systems aim to produce more coherent and conci summaries than their extractive counterparts.Popular neural models have achieved impressive re-sults for single-document summarization,yet their outputs are often incoherent and unfaith-ful to the input.In this pap
er,we intro-duce SENECA ,a novel System for ENtity-drivEn Coherent Abstractive summarization framework that leverages entity information to generate informative and coherent abstracts.Our framework takes a two-step approach:(1)an entity-aware content lection module first identifies salient ntences from the in-put,then (2)an abstract generation module conducts cross-ntence information compres-sion and abstraction to generate the final sum-mary,which is trained with rewards to pro-mote coherence,conciness,and clarity.The two components are further connected using reinforcement learning.Automatic evalua-tion shows that our model significantly out-performs previous state-of-the-art on ROUGE and our propod coherence measures on New York Times and CNN/Daily Mail datats.Human judges further rate our system sum-maries as more informative and coherent than tho by popular summarization models.
1Introduction
Automatic abstractive summarization carries strong promi for producing conci and co-herent summaries to facilitate quick information consumption (Luhn ,1958).Recent progress in neural abstractive summarization has shown end-to-end trained models (Nallapati et al.,2016;Tan et al.,2017a ;Celikyilmaz et al.,2018;Krys cin ski et al.,2018)excelling at producing fluent sum-maries.Though encouraging,their outputs are
*
The authors contributed equally.Work done while LH was at Northeastern University.
Input Article :
...Prime Minister Bertie Ahern of Ireland called Sunday for a general election on May 24.
[Mr.Ahern ]and his centrist party have governed in a coali-tion government
Under Irish law,which requires legislative elections every five years,Mr.Ahern had to call elections by midsummer.On Sunday,{he }said he would ba his campaign for re-election on his work to strengthen the economy and ef-forts to revive Northern Ireland’s stalled peace process this year.
Political analysts said they expected Mr.Ahern ’s work in Northern Ireland to be an ast ...
Human Summary :
1Prime Min Bertie Ahern of Ireland calls for general election on May 24.
2[He ]is required by law to call elections by midsummer.3Opinion polls suggest his centrist governme
nt is in dan-ger of losing its majority in Parliament becau of public disgruntlement about overburdened public rvices.
4{Ahern }says he would ba his campaign for re-election on his work to strengthen economy and his efforts to revive Northern Ireland’s stalled peace process.
5Analysts expect his work in Northern Ireland to be as-t.
Figure 1:Sample summary of an article from the New York Times corpus (Sandhaus ,2008).Mentions of the same entity are colored.Underlined ntence in the ar-ticle occurs relatively at an earlier position in the sum-mary (2to improve topical coherence.Mentions in brackets (“[]”,“{}”)show different ways in which the same entity is referred to in the article and the sum-mary.Detailed explanation is given in 1.
frequently found to be unfaithful to the input and lack inter-ntence coherence (Cao et al.,2018;See et al.,2017;Wiman et al.,2017).The obrvations suggest that existing methods have difficulty in identifying 标准草书
salient entities and related events in the article (Fan et al.,2018),and that existing model training objectives fail to guide the generation of coherent summaries.
In this paper,we prent SENECA ,a System for ENtity-drivEn Coherent Abstractive summa-rization.1We argue that entity-bad modeling
1
Our code is available at evasharma.github.io/SENECA .
a r X i v :1909.02059v 1 [c s .C L ] 4 S e p 2019
Figure2:Our propod entity-driven abstractive sum-marization framework.Entity-aware content lector extracts salient ntences and abstract generator pro-duces informative and coherent summaries.Both com-ponents are connected using reinforcement learning. 竹子的花语
enables enhanced input text interpretation,salient content lection,and coherent summary genera-tion,three major
challenges that need to be ad-dresd by single-document summarization sys-tems(Jones et al.,1999).We u a sample sum-mary in Fig.1to show entity usage in summariza-tion.Firstly,frequently mentioned entities from the input,along with their contextual informa-tion,underscores the salient content of the arti-cle(Nenkova,2008).Secondly,as also discusd in prior work(Barzilay and Lapata,2008;Sid-dharthan et al.,2011),patterns of entity distribu-tions and ho
w they are referred to contribute to the coherence and conciness of the text.For in-stance,a human writer places the underlined n-tence in the input article next to thefirst ntence in the summary to improve topical coherence as they are about the same topic(“elections”).More-over,the human often 名胜古迹的作文
optimizes on conciness by referring to entities with ,“he”)or last ,“Ahern”)without losing clarity. We therefore propo a two-step neural abstrac-tive summarization framework to emulate the way humans construct summaries with the goal of im-proving both informativeness and coher40平小商铺装修
ence of the generated abstracts.As shown in Fig.2,an entity-aware content lection componentfirst lects important ntences from the input that includes references to salient entities.An abstract gen-eration component then produces coherent sum-maries by conducting cross-ntence information ordering,compression,and revision.Our abstract Figure3:Our propod entity-aware content lector. Arrows denote attention,with darker color reprenting higher weights.
generator is trained using reinforcement learning with rewards that promote informativeness and op-tionally boost coherence,conciness,and clar-ity of the summary.To the best of our knowl-edge,we are thefirst to study coherent abstractive summarization with the inclusion of linguistically-informed rewards.
We conduct both automatic and human eval-uation on popular news梦见小孩
summarization datats. Experimental results show that our model yields significantly better ROUGE scores than previous state-of-the-art(Gehrmann et al.,2018;Celiky-ilmaz et al.,2018)as well as higher coherence scores on the New York Times and CNN/Daily Mail datats.Human subjects also rate our sys-tem generated summaries as more informative and coherent than tho of other popular summariza-tion models.
2Summarization Framework
In this ction,we describe our entity-driven ab-stractive summarization framework which follows a two-step approach as shown in Fig.2.It com-pris of(1)an entity-aware content lection component,that leverages entity guidance to lect salient ntences(2.1),and(2)an abstract gener-ation component(2.2),that is trained with rein-forcement learning to generate coherent and con-ci summaries(2.3).Finally,we describe how the two components are connected to further im-prove the generated summaries(2.4).
2.1Entity-Aware Content Selection
We design our content lection component to capture the interaction between entity mentions and the input article.Our model learns to iden-tify salient content by aligning entity mentions
and their contexts with human summaries.Con-cretely,we employ two encoders:one learns entity reprentations by encoding their mention clus-ters and the other learns ntence reprentations.
A pointer-network-bad decoder(Vinyals et al., 2015b)lects a quence of important ntences by jointly attending to the entities and the input,as depicted in Fig.3.
Entity Encoder.We run off-the-shelf coreference resolution system from Stanford CoreNLP(Man-ning et al.,2014)on the input articles to extract entities,each reprented as a cluster of mentions. Specifically,from each input article,we extract the coreferenced entities,and construct the men-tion clusters for all the mentions of each entity in that article.We also consider non coreferenced entity mentions as singleton entity mention clus-ters.Among all the mention clusters,for our experiments,we only consider salient entity men-tion clusters.We label clusters as“salient”bad on two rules:(1)mention clusters with entities ap-pearing in thefirst three ntences of the article, and(2)top k clusters containing most numbers of mentions.We experimented with different values of k and found that k=6g书法评价
ives us the best t of salient mention clusters having an optimal overlap with entity mentions in the ground truth summary. For each mention cluster,we concatenate men-tions of the same entity as they occur in the input into one quence,gmented with special tokens (<MENT>).Finally,we get entity reprentations e i for the i-th entity by encoding each cluster via a te
mporal convolutional model(Kim,2014). Input Article Encoder.For article encoding,we first learn ntence reprentations r j by encod-ing words in the j-th ntence with another tempo-ral convolutional model.Then,we utilize a bidi-rectional LSTM(biLSTM)to aggregate ntences into a quence of hidden states h j.Both the en-coders u a shared word embedding matrix to al-low better alignment.
Sentence Selection Decoder.We employ a single-layer unidirectional LSTM with hidden states s t to recurrently extract salient ntences. At each time step t,wefirst compute an entity con-text vector c e t bad on attention mechanism(Bah-danau et al.,2014):
c e t=
i
a e it e i(1)
a e t=softmax(v e tanh(W e1s t+W e2e i))(2)
where a e t are attention weights,v∗and W∗∗de-note trainable parameters throughout the paper. Bias terms are omitted for simplicity.We further u a glimp operation(Vinyals et al.,2015a)to compute a
ntence context vector c t as follows:
c t=
j
a h jt W h2h j(3)
a h t=softmax(v h tanh(W h1s t+W h2h j))(4) where a h t are attention weights.Finally,ntence extraction probabilities that consider both entity and input context are calculated as:
p(y l t|y l1:t−1)=softmax(v q tanh(W p1s t+W p2c t
+W p3c e t))(5) where the ntence y l t with the highest probabil-ity is lected.The process stops when the model picks the end-of-lection token.
Selection Label Construction.We train our content lection component with a cross-entropy loss:−
(y l,x)∈D
log p(y l|x;),here y l are the ground truth ntence lection labels and x is the input article.denote
s all model parameters.
To acquire training labels for ntence lec-tion,we collect positive ntences in the follow-ing way.First,we employ greedy arch to lect the best combination of ntences that maximizes ROUGE-2F1(Lin and Hovy,2003)with refer-ence to human summary,as described by Zhou et al.(2018).We further include ntences who ROUGE-L recall is above0.5when each is com-pared with its best aligned summary ntence.In cas where no ntence is lected,we label the first two ntences from the article as positive.Our combined construction strategy lects an average of2.96and3.18ntences from New York Times and CNN/Daily Mail articles respectively.
2.2Abstract Generation with Reinforcement
Learning
Our abstract generation component takes the -lected ntences as input and produces thefi-nal summary.This abstract generator is a quence-to-quence network with attention over input(Bahdanau et al.,2014).The copying mech-anism from See et al.(2017)is adopted to allow out-of-vocabulary words to appear in the output. The abstract generator isfirst trained with max-imum likelihood(ML)loss followed by additional training with policy-bad reinforcement learning
(RL).For ML training,we u teacher forcing al-gorithm (Williams and Zipr ,1995),to minimize the following loss:
L ml =−
(y ,x ext )∈D
log p (y |x ext ;)
(6)
where D is the training t,x ext are extracted n-tences from our label construction.
Self-Critical Learning.Following Paulus et al.(2017),we u the lf-critical training algorithm bad on policy gradients to u discrete metrics as RL rewards.At each training step,we gener-ate two summaries:a sampled summary y s ,ob-tained by sampling words from the probability distribution p (y s |x ext ;)at each decoding step,
and a lf-critical baline summary y
,
yielded by greedily lecting words that maximize the output probability at each time step (Rennie et al.,2017).We then calculate rewards bad on the average of ROUGE-L F1and ROUGE-2F1of the two sum-maries against that of the ground-truth summary,and define the following loss function:
L rl =−1
N
(y s ,x ext )∈D
(R (y s )−R (y ))log p (y s |x
ext
;)
(7)
where D reprents t of sampled summaries
paired with extracted input ntences and N rep-rents the total number of sampled summaries.
R (y )=R Rouge (y )=1
2
R Rouge −L (y )+R Rouge −2(y ) ,is the overall ROUGE reward for a summary y .2.3
Rewards with Coherence and Linguistic Quality
So far,we have described the two basic compo-nents of our SENECA framework.As noted in prior work (Liu et al.,2016),optimizing for an ngram-bad metric like ROUGE does not guar-antee improvement over readability of the gener-ations.We thus augment our framework with ad-ditional rewards bad on coherence and linguistic quality as described below.
Entity-Bad Coherence Reward (R Coh ).We u a parately trained coherence model to score summaries and guide our abstract generator to pro-duce more coherent outputs by adding a reward R Coh in the aforementioned RL training process.The new reward takes the following form:
R (y )=R Rouge (y )+Coh R Coh (y )
(8)
Here we show how to calculate R Coh ,to cap-ture both entity distribution patterns and topical
continuity.Since summaries are short,(e.g.2.0ntences on average per summary in the New York Times data),we decide to build our c脸色发黄是什么原因造成的
oher-ence model on top of local coherence estimation for pairwi ntences.We adopt the architecture of neural coherence model developed by Wu and Hu (2018),but train it with samples that enable coherence modeling bad on entity prence and their context.Here we briefly describe the model,and refer the readers to the original paper for de-tails.
Given a pair of ntences S A and S B ,con-volution layers first transform them into hidden reprentations,from which a multi-layer per-ceptron is utilized to yield a coherence score Coh (S A ,S B )∈[−1,1].We train the model with hinge-loss by leveraging both coherent pos-itive samples and incoherent negative samples:
L (S A ,S +B
,S −B )=max {0,1+Coh (S A ,S +B )−Coh (S A ,S −
B
)}(9)
where S A is a target ntence,S A and S +
B is a
positive pair,and S A and S −
B is a negative pair.Note that Wu and Hu (2018)only consider po-sition information for training data construction,
<,S A and S +
B
must be adjacent,and S A and S −B are at most 9ntences away with S −
B randomly picked.We instead introduce two notable features to construct our training data.In addition to being
adjacent,we further constrain S A and S +
B to have
at least one coreferred entity and that S −
B does not.Since our initial experiments show that coher-ence model trained in this manner cannot discern pure repetition of ,simply duplicat-ing words leads to higher coherence,we reu the target ntences themlves as the negative pairs.Finally,since this model outputs pairwi co-herence scores,for a summary containing more than two ntences,we u the average of all ad-jacent ntence pairs’scores as the final summary coherence score.Summaries containing only one ntence get 0coherence score.We also con-duct correlation study to show average aggregation works reasonably well (details in Supplementary).
Linguistic Quality Rewards (R Ref &R App ).We further consider two linguistically-informed re-wards to further improve summary clarity and con-ciness by penalizing (1)improper usage of ref-erential pronouns,and (2)redundancy introduced by non-restrictive appositives and relative claus.
Pronominal Referential Clarity.Referential pro-nouns occurring without the antecedents in a sum-
mary decreas its readability.For instance,a
text with a pronoun“they”occurring before the
required referred entity is introduced,would be
less comprehensible.Therefore,at the RL step,
we either penalize a summary with a reward of
−1for such improper usage,or give0otherwi. In our implementation,we define improper us-
age as the prence of a third personal pronoun
or a posssive pronoun before any noun phra
occurs.The new reward is written as R(y)=
R Rouge(y)+Ref R Ref(y).
Apposition.Next,we consider a reward to teach the model to u apposition and relative clau minimally,which improves summary conciness. For this,we focus on the non-restrictive apposi-tives and relative claus,which often carry non-critical information(Conroy et al.,2006;Wang et al.,2013)and can be automatically detected bad on comma usage patterns.Specifically,a ntence
contains a non-restrictive appositive if i)it contains two commas,and ii)the word after first comma is a posssive pronoun or a deter-minant(Geva et al.,2019).We penalize a sum-mary with−1for using non-restrictive apposi-tives and relative claus,henceforth referred to as apposition,or give0otherwi.Similarly,we have the total reward as R(y)=R Rouge(y)+App R App(y).
2.4Connecting Selection and Abstraction Our entity-aware content lection component ex-tracts salient ntences whereas our abstract gen-eration component compress and paraphras them.Until this point,they are trained parately without any form of parameter sharing.We add an additional step to connect the two networks by training them together via the lf-critical learn-ing algorithm bad on policy gradient(the same methodology as in2.2).
Following the Markov Decision Process formu-
lation,at each time step t,our content lector
generates a t of extracted ntences(x ext)from
an input article.Our abstract generator usx ext
to generate a summary.This summary,evaluated
against the respective human summary,receives
ROUGE-1as reward(See Eq.(7)).Note that the
abstract generator,that has been previously trained
with average of ROUGE-L and ROUGE-2as re-
ward to promotefluency,is not updated during this step.In this extra stage,if our content -lector accurately lects salient ntences,the ab-stract generator is more likely to produce a high-quality summary,and such action will be encour-aged.Whereas,action resulting in inferior lec-tions will be discouraged.
3Experimental Setups
Datats and Preprocessing.We evaluated our models on two popular summarization datats: New York Times(NYT)(Sandhaus,2008)and CNN/Daily Mail(CNN/DM)(Hermann et al., 2015).For NYT,we followed the preprocess-ing steps by Paulus et al.(2017)to obtain similar training(588,909),validation(32,716)and test (32,703)samples.Here,we additionally replaced author names with a special token.For CNN/DM, we followed the preprocessing steps in See et al. (2017),o
btaining287,188training,13,367vali-dation and11,490testing samples.
For training our coherence model for CNN/DM, we ud890,419triples constructed from sum-maries and input articles sampled from the CNN/DM training t.Similarly for NYT,we sampled884,494triples from NYT training t. For the validation and test t for the two mod-els,we sampled roughly10%from the validation and test t of the respective datats.Our coher-ence model for CNN/DM achieves86%accuracy and for NYT,84%.Additional evaluation for this model is reported in4.1.
Training Details and Parameters.We ud a vo-cabulary of50K most common words in the train-ing t(See et al.,2017),with128-dimensional word embeddings randomly initialized and up-dated during training.In the content lection component,for both entity and ntence encoders, we implemented one-layer convolutional network with100dimensions and ud a shared embedding matrix between the two.We employed LSTM models with256-dimensional hidden states for the input article encoder(per direction)and the con-tent lection decoder(Chen and Bansal,2018). We ud a similar tup for the abstract generator encoder and decoder.During ML training of both components,Adam(Kingma and Ba,2015)is ap-plied with a learning rate0.001and a gradient clip-ping2.0,and the batch size32.During RL stage, we reduced learning rate to0.0001(Paulus et al., 201
7)and t batch size to50.For our abstract generator,to reduce variance during RL training,