首页 > 英文翻译

使用StandfordcoreNLP进行中文命名实体识别

更新时间:2023-05-25 08:51:06 阅读：评论：0

使⽤StandfordcoreNLP进⾏中⽂命名实体识别

因为⼯作需要，调研了⼀下Stanford coreNLP的命名实体识别功能。

Stanford CoreNLP是⼀个⽐较厉害的⾃然语⾔处理⼯具，很多模型都是基于深度学习⽅法训练得到的。

先附上其官⽹链接：

stanfordnlp.github.io/CoreNLP/index.html

nlp.stanford.edu/nlp/javadoc/javanlp/

/stanfordnlp/CoreNLP

本⽂主要讲解如何在java⼯程中使⽤Stanford CoreNLP；

1.环境准备

3.5之后的版本都需要java8以上的环境才能运⾏。需要进⾏中⽂处理的话，⽐较占⽤内存，3G左右的内存消耗。

笔者使⽤的maven进⾏依赖的引⼊，使⽤的是3.9.1版本。

直接在pom⽂件中加⼊下⾯的依赖：

混沌之神<dependency>

<groupId>edu.stanford.nlp</groupId>

<artifactId>stanford-corenlp</artifactId>

</dependency>

<groupId>edu.stanford.nlp</groupId>

<artifactId>stanford-corenlp</artifactId>

<classifier>models</classifier>

</dependency>

<groupId>edu.stanford.nlp</groupId>

<artifactId>stanford-corenlp</artifactId>

<classifier>models-chine</classifier>

</dependency>

3个包分别是CoreNLP的算法包、英⽂语料包、中⽂预料包。这3个包的总⼤⼩为1.43G。maven默认镜像在国外，⽽这⼏个依赖包特别⼤，可以找有着三个依赖的国内镜像试⼀下。笔者⽤的是⾃⼰公司的maven仓库。

2.代码调⽤

需要注意的是，因为我是需要进⾏中⽂的命名实体识别，因此需要使⽤中⽂分词和中⽂的词典。我们

可以先打开引⼊的jar包的结构：

其中有个StanfordCoreNLP-chine.properties⽂件，这⾥⾯设定了进⾏中⽂⾃然语⾔处理的⼀些参数。主要指定相应的pipeline的操作步骤以及对应的预料⽂件的位置。实际上我们可能⽤不到所有的步骤，或者要使⽤不同的语料库，因此可以⾃定义配置⽂件，然后再引⼊。那在我的项⽬中，我就直接读取了该properties⽂件。

attention：此处笔者要使⽤的是ner功能，但可能不想使⽤其他的⼀些annotation，想去掉。然⽽，Stanford CoreNLP有⼀些局限，就是在ner执⾏之前，⼀定需要

tokenize, ssplit, pos, lemma

的引⼊，当然这增加了很⼤的时间耗时。

其实我们可以先来分析⼀下这个properties⽂件：

# Pipeline options - lemma is no-op for Chine but currently needed becau coref demands it (bad old requirements system)

annotators = tokenize, ssplit, pos, lemma, ner, par, coref

# gment

tokenize.language = zh

gment.sighanCorporaDict = edu/stanford/nlp/models/gmenter/chine

gment.rDictionary = edu/stanford/nlp/models/gmenter/

gment.sighanPostProcessing = true

# ntence split

ssplit.boundaryTokenRegex = [.。]|[!?！？]+

# pos

# ner 此处设定了ner使⽤的语⾔、模型（crf），⽬前SUTime只⽀持英⽂，不⽀持中⽂，所以设置为fal。

ner.language = chine

ner.applyNumericClassifiers = true

ner.uSUTime = fal琅琊榜电视剧下载

# regexner

er.mapping = edu/stanford/nlp/models/kbp/chine/cn_regexner_mapping.tab

er.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE

# par

# deppar

deppar.language = chine

# coref

coref.sieves = ChineHeadMatch, ExactStringMatch, PreciConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch

pe = raw

coref.postprocessing = true

coref.calculateFeatureImportance = fal

coref.uConstituencyTree = true

coref.uSemantics = fal

coref.algorithm = hybrid

coref.path.word2vec =

coref.language = zh

coref.defaultPronounAgreement = true

coref.zh.dict = edu/stanford/nlp/models/

coref.print.md.log = fal

pe = RULE

coref.md.liberalChineMD = fal

# kbp上海世博会会徽图案

< = edu/stanford/nlp/models/kbp/chine/mgrex

kbp.language = zh

# entitylink

entitylink.wikidict = edu/stanford/nlp/models/kbp/chine/wikidict_

那我们就直接在代码中引⼊这个properties⽂件，参考代码如下：

package lp;

import java.util.List;

import java.util.Map;

import java.util.Properties;

import edu.f.CorefCoreAnnotations;

import edu.f.data.CorefChain;

import edu.stanford.nlp.ling.CoreAnnotations;

import edu.stanford.nlp.ling.CoreLabel;

import edu.stanford.nlp.pipeline.Annotation;

import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.aph.SemanticGraph;

import edu.aph.SemanticGraphCoreAnnotations;

import edu.s.Tree;

import edu.s.TreeCoreAnnotations;

import edu.stanford.nlp.util.CoreMap;

/**

* Created by sonofelice on 2018/3/27.

public class TestNLP {

public void test() throws Exception {

//构造⼀个StanfordCoreNLP对象，配置NLP的功能，如lemma是词⼲化，ner是命名实体识别等

Properties props = new Properties();

props.Class().getResourceAsStream("/StanfordCoreNLP-chine.properties"));

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

String text = "袁隆平是中国科学院的院⼠，他于2009年10⽉到中国⼭东省东营市东营区永乐机场附近承包了⼀千亩盐碱地," + "开始种植棉花, 年产量达到⼀万吨, 哈哈, 反正棣琦说的是假的，逗你玩⼉，明天下午2点来我家吃饭吧。"

+ "棣琦是⼭东⼤学毕业的,⽬前在百度做java开发，位置是东北旺东路102号院,⼿机号14366778890";

long startTime = System.currentTimeMillis();

// 创造⼀个空的Annotation对象

Annotation document = new Annotation(text);

// 对⽂本进⾏分析

pipeline.annotate(document);

//获取⽂本处理结果

List<CoreMap> ntences = (CoreAnnotations.SentencesAnnotation.class);

for (CoreMap ntence : ntences) {

// traversing the words in the current ntence

// a CoreLabel is a CoreMap with additional token-specific methods

for (CoreLabel token : (CoreAnnotations.TokensAnnotation.class)) {

//// 获取句⼦的token（可以是作为分词后的词语）

String word = (CoreAnnotations.TextAnnotation.class);

System.out.println(word);

//词性标注

String pos = (CoreAnnotations.PartOfSpeechAnnotation.class);

System.out.println(pos);

// 命名实体识别

从容的意思String ne = (CoreAnnotations.NormalizedNamedEntityTagAnnotation.class);

String ner = (CoreAnnotations.NamedEntityTagAnnotation.class);

System.out.println(word + " | analysis : { original : " + ner + "," + " normalized : "

+ ne + "}");

//词⼲化处理

String lema = (CoreAnnotations.LemmaAnnotation.class);

System.out.println(lema);

}

// 句⼦的解析树

Tree tree = (TreeCoreAnnotations.TreeAnnotation.class);

System.out.println("句⼦的解析树:");

tree.pennPrint();

// 句⼦的依赖图

SemanticGraph graph =

<(SemanticGraphCoreAnnotations.CollapdCCProcesdDependenciesAnnotation.class);

System.out.println("句⼦的依赖图");

System.out.String(SemanticGraph.OutputFormat.LIST));

}

long endTime = System.currentTimeMillis();

long time = endTime - startTime;

System.out.println("The analysis lasts " + time + " conds * 1000");

// 指代词链

//每条链保存指代的集合

// 句⼦和偏移量都从1开始

castle

Map<Integer, CorefChain> corefChains = (CorefCoreAnnotations.CorefChainAnnotation.class);

if (corefChains == null) {

return;

}

for (Map.Entry<Integer, CorefChain> entry : Set()) {

System.out.println("Chain " + Key() + "");

for (CorefChain.CorefMention m : Value().getMentionsInTextualOrder()) {

// We need to subtract one since the indices count from 1 but the Lists start from 0

List<CoreLabel> tokens = (m.ntNum - 1).get(CoreAnnotations.TokensAnnotation.class);

// We subtract two for end: one for 0-bad indexing, and one becau we want last token of mention

// not one following.

System.out.println(

"" + m + ", i.e., 0-bad character offts [" + (m.startIndex - 1).beginPosition()

", " + (m.endIndex - 2).endPosition() + ")");

}

public static void main(String[] args) throws Exception {

TestNLP nlp=new TestNLP();

}

当然，我在运⾏过程中，只保留了ner相关的分析，别的功能注释掉了。输出结果如下：

19:46:16.000 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos

19:46:19.387 [main] INFO e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chine-distsim/chine-distsim.tagger ... done [3.4 c].

19:46:19.388 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma

19:46:19.389 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ner

19:46:21.938 [main] INFO ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chine. ... done [2.5 c].

19:46:22.099 [main] WARN p.TokensRegexNERAnnotator - TokensRegexNERAnnotator er: Entry has multiple types for ner: 巴伐利亚 STATE_OR_PROVINCE MISC,GPE,LOCATION 1. Taking type to be MISC

19:46:22.100 [main] WARN p.TokensRegexNERAnnotator - TokensRegexNERAnnotator er: Entry has multiple types for ner: 巴伐利亚州 STATE_OR_PROVINCE MISC,GPE,LOCATION 1. Taking type to be MISC 19:46:22.100 [main] INFO p.TokensRegexNERAnnotator - TokensRegexNERAnnotator er: Read 21238 un

ique entries out of 21249from edu/stanford/nlp/models/kbp/chine/cn_regexner_mapping.tab, 0 TokensRegex patter 19:46:22.532 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator par

19:46:35.855 [main] INFO e.s.ParrGrammar - Loading parr from rialized file edu/stanford/nlp/models/ ... done [13.3 c].

19:46:35.859 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator coref

19:46:43.139 [main] INFO pipeline.CorefMentionAnnotator - Using mention detector type: rule

19:46:43.148 [main] INFO e.s.nlp.wordg.ChineDictionary - Loading Chine dictionaries from1 file:

19:46:43.148 [main] INFO e.s.nlp.wordg.ChineDictionary - edu/stanford/nlp/models/gmenter/

19:46:43.329 [main] INFO e.s.nlp.wordg.ChineDictionary - Done. Unique words in ChineDictionary is: 423200.

17作业网英语

19:46:43.379 [main] INFO edu.stanford.nlp.wordg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/gmenter/chine/dict/character_list [done].

19:46:43.380 [main] INFO e.s.nlp.wordg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/gmenter/chine/b [done].

袁隆平 | analysis : { original : PERSON, normalized : null}

是 | analysis : { original : O, normalized : null}

中国 | analysis : { original : ORGANIZATION, normalized : null}

科学院 | analysis : { original : ORGANIZATION, normalized : null}

的 | analysis : { original : O, normalized : null}

院⼠ | analysis : { original : TITLE, normalized : null}

, | analysis : { original : O, normalized : null}

他 | analysis : { original : O, normalized : null}

于 | analysis : { original : O, normalized : null}

2009年 | analysis : { original : DATE, normalized : 2009-10-XX}

10⽉ | analysis : { original : DATE, normalized : 2009-10-XX}

到 | analysis : { original : O, normalized : null}

中国 | analysis : { original : COUNTRY, normalized : null}

⼭东省 | analysis : { original : STATE_OR_PROVINCE, normalized : null}

东营市 | analysis : { original : CITY, normalized : null}

东营区 | analysis : { original : FACILITY, normalized : null}

永乐 | analysis : { original : FACILITY, normalized : null}

机场 | analysis : { original : FACILITY, normalized : null}

附近 | analysis : { original : O, normalized : null}

福山雅治最爱承包 | analysis : { original : O, normalized : null}

了 | analysis : { original : O, normalized : null}

⼀千 | analysis : { original : NUMBER, normalized : 1000}

亩 | analysis : { original : O, normalized : null}

盐 | analysis : { original : O, normalized : null}

碱地 | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

开始 | analysis : { original : O, normalized : null}

种植 | analysis : { original : O, normalized : null}

棉花 | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

年产量 | analysis : { original : O, normalized : null}

达到 | analysis : { original : O, normalized : null}

⼀万 | analysis : { original : NUMBER, normalized : 10000}

吨 | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

哈哈 | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

反正 | analysis : { original : O, normalized : null}

棣琦 | analysis : { original : PERSON, normalized : null}

联合王国

说 | analysis : { original : O, normalized : null}

的 | analysis : { original : O, normalized : null}

是 | analysis : { original : O, normalized : null}

假 | analysis : { original : O, normalized : null}

的 | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

逗 | analysis : { original : O, normalized : null}

你 | analysis : { original : O, normalized : null}

玩⼉ | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

明天 | analysis : { original : DATE, normalized : XXXX-XX-XX}

下午 | analysis : { original : TIME, normalized : null}

2点 | analysis : { original : TIME, normalized : null}

来 | analysis : { original : O, normalized : null}

我 | analysis : { original : O, normalized : null}

家 | analysis : { original : O, normalized : null}

吃饭 | analysis : { original : O, normalized : null}

吧 | analysis : { original : O, normalized : null}

。 | analysis : { original : O, normalized : null}

棣琦 | analysis : { original : PERSON, normalized : null}

是 | analysis : { original : O, normalized : null}

⼭东 | analysis : { original : ORGANIZATION, normalized : null}

⼤学 | analysis : { original : ORGANIZATION, normalized : null}

毕业 | analysis : { original : O, normalized : null}

的 | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

⽬前 | analysis : { original : DATE, normalized : null}

在 | analysis : { original : O, normalized : null}

百度 | analysis : { original : ORGANIZATION, normalized : null}

做 | analysis : { original : O, normalized : null}

java | analysis : { original : O, normalized : null}

开发 | analysis : { original : O, normalized : null}

, | analysis : { original : O, normalized : null}

位置 | analysis : { original : O, normalized : null}

是 | analysis : { original : O, normalized : null}

东北 | analysis : { original : LOCATION, normalized : null}

旺 | analysis : { original : O, normalized : null}

东路 | analysis : { original : O, normalized : null}

102 | analysis : { original : NUMBER, normalized : 102}

号院 | analysis : { original : O, normalized : null}

conductor是什么意思

, | analysis : { original : O, normalized : null}

⼿机号 | analysis : { original : O, normalized : null}

143667788 | analysis : { original : NUMBER, normalized : 14366778890}

90 | analysis : { original : NUMBER, normalized : 14366778890}

The analysis lasts 819 conds * 1000

Process finished with exit code 0

我们可以看到，整个⼯程的启动耗时还是挺久的。分析过程也⽐较耗时，819毫秒。

并且结果也不够准确，跟我在其官⽹在线demo得到的结果还是有些差异的：latest bbc world news

本文发布于:2023-05-25 08:51:06，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/90/121905.html

上一篇：金蝶K312.2WIS材料耗用记录

下一篇：支付结算术语和缩略语之一

标签：需要识别实体功能命名位置

留言与评论（共有 0 条评论）