5个简单的步骤使⽤Pytorch进⾏⽂本摘要总结
介绍
⽂本摘要是⾃然语⾔处理(NLP)的⼀项任务,其⽬的是⽣成源⽂本的简明摘要。不像摘录摘要,摘要不仅仅简单地从源⽂本复制重要的短语,还要提出新的相关短语,这可以被视为释义。摘要在不同的领域产⽣了⼤量的应⽤,从书籍和⽂献,科学和研发,⾦融研究和法律⽂件分析。
到⽬前为⽌,对抽象摘要最有效的⽅法是在摘要数据集上使⽤经过微调的transformer模型。在本⽂中,我们将演⽰如何在⼏个简单步骤中使⽤功能强⼤的模型轻松地总结⽂本。我们将要使⽤的模型已经经过了预先训练,所以不需要额外的训练:)
让我们开始吧!
步骤1:安装Transformers库
我们要⽤的库是Huggingface实现的Transformers 。如果你不熟悉Transformers ,你可以继续阅读我之前的⽂章。
要安装变压器,您可以简单地运⾏:
pip install transformers
注意需要事先安装Pytorch。如果您还没有安装Pytorch,请访问Pytorch官⽅⽹站并按照说明安装它。
步骤2:导⼊库
成功安装transformer之后,现在可以开始将其导⼊到Python脚本中。我们也可以导⼊os来设置GPU在下⼀步使⽤的环境变量。注意,这是完全可选的,但如果您有多个gpu(如果您使⽤的是jupiter笔记本),这是防⽌错误的使⽤其他gpu的⼀个好做法。
from transformers import pipeline
import os
手机来电秀步骤3:设置使⽤的GPU和模型
如果你决定设置GPU(例如0),那么你可以如下图所⽰:
现在,我们准备好选择要使⽤的摘要模型了。Huggingface提供两种强⼤的摘要模型使⽤:BART (BAR
T -large-cnn)和t5 (t5-small, t5-ba, t5-large, t5- 3b, t5- 11b)。你可以在他们的官⽅paper(BART paper, t5 paper)上了解更多。
要使⽤在CNN/每⽇邮报新闻数据集上训练的BART模型,您可以通过Huggingface的内置管道模块直接使⽤默认参数:
颧髎怎么读
summarizer = pipeline("summarization")
如果你想使⽤t5模型(例如t5-ba),它是在c4 Common Crawl web语料库进⾏预训练的,那么你可以这样做:
summarizer = pipeline("summarization", model="t5-ba", tokenizer="t5-ba", framework="tf")
步骤4:输⼊⽂本进⾏总结
现在,在我们准备好我们的模型之后,我们可以开始输⼊我们想要总结的⽂本。想象⼀下,我们想从MedicineNet的⼀篇⽂章中总结以下关于COVID-19疫苗的内容:
One month after the United States began what has become a troubled rollout of a national COVID vaccination
管理的载体
campaign, the effort is finally gathering real steam.
Clo to a million dos — over 951,000, to be more exact — made their way into the arms of Americans in the past
鄱阳湖的读音24 hours, the U.S. Centers for Dia Control and Prevention reported Wednesday. That’s the largest number of
shots given in one day since the rollout began and a big jump from the previous day, when just under 340,000 dos were given, CBS News reported.
That number is likely to jump quickly after the federal government on Tuesday gave states the OK to vaccinate anyone over 65 and said it would relea all the dos of vaccine it has available for distribution. Meanwhile, a number of states have now opened mass vaccination sites in an effort to get larger numbers of people inoculated, CBS News reported.电焊工考试
牛肉汤做法我们定义变量:
text = """One month after the United States began what has become a troubled rollout of a national COVID vaccination campaign, the effort is finally gathering real steam.
Clo to a million dos -- over 951,000, to be more exact -- made their way into the arms of Americans in the past 24 hours, the U.S. Centers for Dia Control and Prevention reported Wednesday. That's the largest number of shots given in one day since the rollout began and a big jump from the previous day, when just under 340,000 dos were given, CBS News reported.
That number is likely to jump quickly after the federal government on Tuesday gave states the OK to vaccinate anyone over 65 and said it would relea all the dos of vaccine it has available for distribution. Meanwhile, a number of states have now opened mass vaccination sites in an effort to get larger numbers of people inoculated, CBS News reported."""
步骤4:总结
最后,我们可以开始总结输⼊的⽂本。这⾥,我们声明了希望汇总输出的min_length和max_length,并且关闭了采样以⽣成固定的汇总。我们可以通过运⾏以下命令来实现:
summary_text = summarizer(text, max_length=100, min_length=5, do_sample=Fal)[0]['summary_text']
print(summary_text)
我们得到总结⽂本:
Over 951,000 dos of vaccine given in one day in the past 24 hours, CDC says . That’s the largest number of shots given in a month since the rollout began . The federal government gave states the OK to vaccinate anyone over 65 on Tuesday . A number of states have now opened mass vaccination sites in an effort to get more people inoculated, CBS News reports .
从总结的⽂本中可以看出,该模型知道24⼩时相当于⼀天,并聪明地将美国疾病控制与预防中⼼(U.S. Centers for Dia Control and Prevention)缩写为CDC。此外,该模型成功地从第⼀段和第⼆段链接信息,指出这是⾃上个⽉开始展⽰以来给出的最⼤次数。我们可以看到,该摘要模型的性能相当不错。
最后把所有这些放在⼀起,这⾥是jupyter notebook形式的整个代码:歌曲曲谱
/itsuncheng/f3c4dde81ac4651383c4480958da4f8e#file-summarization-ipynb
献血证Lewis, Mike, et al. “Bart: Denoising quence-to-quence pre-training for natural language generation, translation, and comprehension.” arXiv preprint arXiv:1910.13461 (2019).
Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” ar
Xiv preprint
arXiv:1910.10683 (2019).