NLPLemmatisation(词性还原)和Stemming(词干提取)NLTKpos_。。。

更新时间:2023-05-06 01:41:05 阅读: 评论:0

NLPLemmatisation(词性还原)和Stemming(词⼲提取)NLTKpos_。。。
词形还原(lemmatization),是把⼀个词汇还原为⼀般形式(能表达完整语义),⽅法较为复杂;⽽词⼲提取(stemming)是抽取词的词⼲或词根形式(不⼀定能够表达完整语义),⽅法较为简单。
Stemming(词⼲提取):
基于语⾔的规则。如英语中名词变复数形式规则。由于基于规则,可能出现规则外的情况。
# Porter Stemmer基于Porter词⼲提取算法
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem('leaves')
# 输出:'leav'
# 但实际应该是名词'leaf'
nltk中主要有以下⽅法:
# 基于Porter词⼲提取算法
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
porter_stemmer.stem(‘maximum’)
# 基于Lancaster 词⼲提取算法
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
lancaster_stemmer.stem(‘maximum’)
# 基于Snowball 词⼲提取算法
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer(“english”)
snowball_stemmer.stem(‘maximum’)
Lemmatisation(词性还原):
基于字典的映射。nltk中要求⼿动注明词性,否则可能会有问题。因此⼀般先要分词、词性标注,再词性还原。
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('leaves')
# 输出:'leaf'
完整过程:
word_tokenize("apples % , I've loves green")
pos_tag(word_tokenize("apples % , I've loves green"))
wnl = WordNetLemmatizer()
wnl.lemmatize('apples', pos='n')
def lemmatize_all(ntence):
wnl = WordNetLemmatizer()
for word, tag in pos_tag(word_tokenize(ntence)):
if tag.startswith('NN'):
yield wnl.lemmatize(word, pos='n')
elif tag.startswith('VB'):
yield wnl.lemmatize(word, pos='v')
elif tag.startswith('JJ'):
yield wnl.lemmatize(word, pos='a')
elif tag.startswith('R'):
yield wnl.lemmatize(word, pos='r')
el:
yield word
train_f = []
test_f = []
for i in range(0, len(train_feature)):
train_f.append(' '.join(lemmatize_all(train_feature[i])))
for i in range(0, len(test_feature)):
test_f.append(' '.join(lemmatize_all(test_train[i]))) NLTK词性:
CC 连词and, or,but, if, while,although
CD 数词 twenty-four, fourth, 1991,14:24
DT 限定词the, a, some, most,every, no
EX 存在量词 there, there's
FW 外来词 dolce, ersatz, esprit, quo,maitre
IN 介词连词on, of,at, with,by,into, under
JJ 形容词 new,good, high, special, big, local
JJR ⽐较级词语 bleaker braver breezier briefer brighter brisker
JJS 最⾼级词语 calmest cheapest choicest classiest cleanest clearest LS 标记 A A. B B. C C. D E F First G H I J K
MD 情态动词 can cannot could couldn't
NN 名词year,home, costs, time, education
NNS 名词复数 undergraduates scotches
NNP 专有名词 Alison,Africa,April,Washington
NNPS 专有名词复数 Americans Americas Amharas Amityvilles
PDT 前限定词 all both half many
POS 所有格标记 ' 's
PRP ⼈称代词 hers herlf him himlf hislf
PRP$ 所有格 her his mine my our ours
RB 副词 occasionally unabatingly maddeningly
RBR 副词⽐较级 further gloomier grander
RBS 副词最⾼级 best biggest bluntest earliest
RP 虚词 aboard about across along apart
SYM 符号 % & ' '' ''. ) )
TO 词to to
UH 感叹词 Goodbye Goody Gosh Wow
VB 动词 ask asmble asss
VBD 动词过去式 dipped pleaded swiped
VBG 动词现在分词 telegraphing stirring focusing
VBN 动词过去分词 multihulled dilapidated aerosolized
VBP 动词现在式⾮第三⼈称时态 predominate wrap resort sue
VBZ 动词现在式第三⼈称时态 bas reconstructs marks
WDT Wh限定词 who,which,when,what,where,how
WP WH代词that what whatever
WP$ WH代词所有格who
WRB WH副词
# 查看说明
nltk.help.upenn_tagt(‘JJ’)

本文发布于:2023-05-06 01:41:05,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/536334.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:词性   规则   还原
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图