首页 > 美文阅读

常见10种自然语言处理技术

更新时间:2023-06-01 18:02:01 阅读：评论：0

常见10种⾃然语⾔处理技术

引⾔

⾃然语⾔处理（NLP）是⼀种艺术与科学的结合，旨在从⽂本数据中提取信息。在它的帮助下，我们从⽂本中提炼出适⽤于计算机算法的信息。从⾃动翻译、⽂本分类到情绪分析，⾃然语⾔处理成为所有数据科学家的必备技能之⼀。

常见的10个NLP任务如下：

1. 词⼲提取

2. 词形还原

3. 词向量化

4. 词性标注

5. 命名实体消岐

6. 命名实体识别

7. 情感分析

8. ⽂本语义相似分析

9. 语种辨识

10. ⽂本总结

以下将详细展开：

1.词⼲提取

什么是词⼲提取？词⼲提取是将词语去除变化或衍⽣形式，转换为词⼲或原型形式的过程。词⼲提取的⽬标是将相关词语还原为同样的词⼲，哪怕词⼲并⾮词典的词⽬。例如，英⽂中:

1. beautiful和beautifully的词⼲同为beauti

2. Good,better和best 的词⼲分别为good,better和best。

相关⼯具：

预训练词向量：

#!pip install gensim

dels.keyedvectors import KeyedVectors

word_vectors=KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin',binary=True)

word_vectors['human']

程序实现：这段代码可以⽤gensim训练你⾃⼰的词向量

ntence=[['first','ntence'],['cond','ntence']]

处对象model = dels.Word2Vec(ntence, min_count=1,size=300,workers=4)

4.词性标注

什么事词性标注？简单来说，词性标注是对句⼦中的词语标注为名字、动词、形容词、副词等的过程。例如，对句⼦“Ashok killed the snake with a stick”，词性标注会识别：

Ashok 代词

killed 动词

the 限定词

snake 名词

with 连词

a 限定词

stick 名词

. 标点

论⽂1：

论⽂2：

程序实现：这段代码可以在spacy上做词性标注

#!pip install spacy

#!python -m spacy download en

nlp=spacy.load('en')

ntence="Ashok killed the snake with a stick"

for token in nlp(ntence):

print(token,token.pos_)

5. 命名实体消歧

什么是命名实体消岐？命名实体消岐是对句⼦中的提到的实体识别的过程。例如，对句⼦“Apple earn

ed a revenue of 200 Billion USD in 2016”，命名实体消岐会推断出句⼦中的Apple是苹果公司⽽不是指⼀种⽔果。⼀般来说，命名实体要求有⼀个实体知识库，能够将句⼦中提到的实体和知识库联系起来。

论⽂1：

论⽂2：

6. 命名实体识别

体识别是识别⼀个句⼦中有特定意义的实体并将其区分为⼈名，机构名，⽇期，地名，时间等类别的任务。例如，⼀个NER会将⼀个这样的句⼦：

“Ram of Apple Inc. travelled to Sydney on 5th October 2017”

返回如下的结果：

Ram

乱雨纷飞

Apple ORG三级护理

Inc. ORG

travelled

Sydney GPE

5th DATE

October DATE

2017 DATE

这⾥，ORG代表机构组织名，GPE代表地名。

歌和老街然⽽，当NER被⽤在不同于该NER被训练的数据领域时，即使是最先进的NER也往往表现不佳。

论⽂：

程序实现：以下使⽤spacy执⾏命名实体识别。

import spacy

nlp=spacy.load('en')ntence="Ram of Apple Inc. travelled to Sydney on 5th October 2017"

for token in nlp(ntence):

print(token, _type_)

7. 情感分析

什么是情感分析？情感分析是⼀种⼴泛的主观分析，它使⽤⾃然语⾔处理技术来识别客户评论的语义情感，语句表达的情绪正负⾯以及通过语⾳分析或书⾯⽂字判断其表达的情感等等。例如：

“我不喜欢巧克⼒冰淇淋”—是对该冰淇淋的负⾯评价。

“我并不讨厌巧克⼒冰激凌”—可以被认为是⼀种中性的评价。

从使⽤LSTMs和Word嵌⼊来计算⼀个句⼦中的正负词数开始，有很多⽅法都可以⽤来进⾏情感分析。

博⽂1：

博⽂2：

论⽂1：非体积功

论⽂2：

资料库：

数据集1

数据集2：

竞赛：

8. 语义⽂本相似度

什么是语义⽂本相似度分析？语义⽂本相似度分析是对两段⽂本的意义和本质之间的相似度进⾏分析的过程。注意，相似性与相关性是不同的。

例如：

汽车和公共汽车是相似的，但是汽车和燃料是相关的。

论⽂1：

论⽂2：

论⽂3：

9. 语⾔识别

什么是语⾔识别？语⾔识别指的是将不同语⾔的⽂本区分出来。其利⽤语⾔的统计和语法属性来执⾏此任务。语⾔识别也可以被认为是⽂本分类的特殊情况。

博⽂：

论⽂1：

论⽂2：

10. ⽂本摘要

什么是⽂本摘要？⽂本摘要是通过识别⽂本的重点并使⽤这些要点创建摘要来缩短⽂本的过程。⽂本摘要的⽬的是在不改变⽂本含义的前提下最⼤限度地缩短⽂本。

论⽂1：

糗百论⽂2：

资料库：

应⽤程序：

程序实现：以下是如何⽤gensim包快速实现⽂本摘要。

from gensim.summarization import summarize

ntence ="Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.Automatic data s ummarization is part of machine learning and data mining. The main idea of summarization is to find a subt of data which contains the information of the entire t. Such techn

iques are widely ud in industry today. Search engines are an example; others include summarization of documents, image collectio ns and videos. Document summarization tries to create a reprentative summary or abstract of the entire document, by finding the most informative nte nces, while in image summarization the system finds the most reprentative and important (i.e. salient) images. For surveillance videos, one might want t o extract the important events from the uneventful context.There are two general approaches to automatic summarization: extraction and abstraction. Extr active methods work by lecting a subt of existing words, phras, or ntences in the original text to form the summary. In contrast, abstractive metho ds build an internal mantic reprentation and then u natural language generation techniques to create a summary that is clor to what a human mig ht express. Such a summary might include verbal innovations. Rearch to date has focud primarily on extractive methods, which are appropriate for im age collection summarization and video summarization."

summarize(ntence)

结束语

以上所有是最流⾏的NLP任务以及相关的博客、研究论⽂、资料库、应⽤等资源。

本文发布于:2023-06-01 18:02:01，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/828024.html

上一篇：【推荐】感恩学生演讲稿模板五篇

下一篇：感恩节演讲模板：感恩的花朵