使⽤StandfordcoreNLP进⾏中⽂命名实体识别
因为⼯作需要,调研了⼀下Stanford coreNLP的命名实体识别功能。
Stanford CoreNLP是⼀个⽐较厉害的⾃然语⾔处理⼯具,很多模型都是基于深度学习⽅法训练得到的。
先附上其官⽹链接:
stanfordnlp.github.io/CoreNLP/index.html
nlp.stanford.edu/nlp/javadoc/javanlp/
/stanfordnlp/CoreNLP
本⽂主要讲解如何在java⼯程中使⽤Stanford CoreNLP;
1.环境准备
3.5之后的版本都需要java8以上的环境才能运⾏。需要进⾏中⽂处理的话,⽐较占⽤内存,3G左右的内存消耗。
笔者使⽤的maven进⾏依赖的引⼊,使⽤的是3.9.1版本。
直接在pom⽂件中加⼊下⾯的依赖:
混沌之神<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.9.1</version>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.9.1</version>
<classifier>models</classifier>
</dependency>
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.9.1</version>
<classifier>models-chine</classifier>
</dependency>
3个包分别是CoreNLP的算法包、英⽂语料包、中⽂预料包。这3个包的总⼤⼩为1.43G。maven默认镜像在国外,⽽这⼏个依赖包特别⼤,可以找有着三个依赖的国内镜像试⼀下。笔者⽤的是⾃⼰公司的maven仓库。
2.代码调⽤
需要注意的是,因为我是需要进⾏中⽂的命名实体识别,因此需要使⽤中⽂分词和中⽂的词典。我们
可以先打开引⼊的jar包的结构:
其中有个StanfordCoreNLP-chine.properties⽂件,这⾥⾯设定了进⾏中⽂⾃然语⾔处理的⼀些参数。主要指定相应的pipeline的操作步骤以及对应的预料⽂件的位置。实际上我们可能⽤不到所有的步骤,或者要使⽤不同的语料库,因此可以⾃定义配置⽂件,然后再引⼊。那在我的项⽬中,我就直接读取了该properties⽂件。
attention:此处笔者要使⽤的是ner功能,但可能不想使⽤其他的⼀些annotation,想去掉。然⽽,Stanford CoreNLP有⼀些局限,就是在ner执⾏之前,⼀定需要
tokenize, ssplit, pos, lemma
的引⼊,当然这增加了很⼤的时间耗时。
其实我们可以先来分析⼀下这个properties⽂件:
# Pipeline options - lemma is no-op for Chine but currently needed becau coref demands it (bad old requirements system)
annotators = tokenize, ssplit, pos, lemma, ner, par, coref
# gment
tokenize.language = zh
gment.sighanCorporaDict = edu/stanford/nlp/models/gmenter/chine
gment.rDictionary = edu/stanford/nlp/models/gmenter/
gment.sighanPostProcessing = true
# ntence split
ssplit.boundaryTokenRegex = [.。]|[!?!?]+
# pos
# ner 此处设定了ner使⽤的语⾔、模型(crf),⽬前SUTime只⽀持英⽂,不⽀持中⽂,所以设置为fal。
ner.language = chine
ner.applyNumericClassifiers = true
ner.uSUTime = fal琅琊榜电视剧下载
# regexner
er.mapping = edu/stanford/nlp/models/kbp/chine/cn_regexner_mapping.tab
er.noDefaultOverwriteLabels = CITY,COUNTRY,STATE_OR_PROVINCE
# par
# deppar
deppar.language = chine
# coref
coref.sieves = ChineHeadMatch, ExactStringMatch, PreciConstructs, StrictHeadMatch1, StrictHeadMatch2, StrictHeadMatch3, StrictHeadMatch4, PronounMatch
pe = raw
coref.postprocessing = true
coref.calculateFeatureImportance = fal
coref.uConstituencyTree = true
coref.uSemantics = fal
coref.algorithm = hybrid
coref.path.word2vec =
coref.language = zh
coref.defaultPronounAgreement = true
coref.zh.dict = edu/stanford/nlp/models/
coref.print.md.log = fal
pe = RULE
coref.md.liberalChineMD = fal
# kbp上海世博会会徽图案
< = edu/stanford/nlp/models/kbp/chine/mgrex
kbp.language = zh
# entitylink
entitylink.wikidict = edu/stanford/nlp/models/kbp/chine/wikidict_
那我们就直接在代码中引⼊这个properties⽂件,参考代码如下:
package lp;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import edu.f.CorefCoreAnnotations;
import edu.f.data.CorefChain;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.aph.SemanticGraph;
import edu.aph.SemanticGraphCoreAnnotations;
import edu.s.Tree;
import edu.s.TreeCoreAnnotations;
import edu.stanford.nlp.util.CoreMap;
/**
* Created by sonofelice on 2018/3/27.
*/
public class TestNLP {
public void test() throws Exception {
//构造⼀个StanfordCoreNLP对象,配置NLP的功能,如lemma是词⼲化,ner是命名实体识别等
Properties props = new Properties();
props.Class().getResourceAsStream("/StanfordCoreNLP-chine.properties"));
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
String text = "袁隆平是中国科学院的院⼠,他于2009年10⽉到中国⼭东省东营市东营区永乐机场附近承包了⼀千亩盐碱地," + "开始种植棉花, 年产量达到⼀万吨, 哈哈, 反正棣琦说的是假的,逗你玩⼉,明天下午2点来我家吃饭吧。"
+ "棣琦是⼭东⼤学毕业的,⽬前在百度做java开发,位置是东北旺东路102号院,⼿机号14366778890";
long startTime = System.currentTimeMillis();
// 创造⼀个空的Annotation对象
Annotation document = new Annotation(text);
// 对⽂本进⾏分析
pipeline.annotate(document);
//获取⽂本处理结果
List<CoreMap> ntences = (CoreAnnotations.SentencesAnnotation.class);
for (CoreMap ntence : ntences) {
// traversing the words in the current ntence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token : (CoreAnnotations.TokensAnnotation.class)) {
//// 获取句⼦的token(可以是作为分词后的词语)
String word = (CoreAnnotations.TextAnnotation.class);
System.out.println(word);
//词性标注
String pos = (CoreAnnotations.PartOfSpeechAnnotation.class);
System.out.println(pos);
// 命名实体识别
从容的意思String ne = (CoreAnnotations.NormalizedNamedEntityTagAnnotation.class);
String ner = (CoreAnnotations.NamedEntityTagAnnotation.class);
System.out.println(word + " | analysis : { original : " + ner + "," + " normalized : "
+ ne + "}");
//词⼲化处理
String lema = (CoreAnnotations.LemmaAnnotation.class);
System.out.println(lema);
}
// 句⼦的解析树
Tree tree = (TreeCoreAnnotations.TreeAnnotation.class);
System.out.println("句⼦的解析树:");
tree.pennPrint();
// 句⼦的依赖图
SemanticGraph graph =
<(SemanticGraphCoreAnnotations.CollapdCCProcesdDependenciesAnnotation.class);
System.out.println("句⼦的依赖图");
System.out.String(SemanticGraph.OutputFormat.LIST));
}
long endTime = System.currentTimeMillis();
long time = endTime - startTime;
System.out.println("The analysis lasts " + time + " conds * 1000");
// 指代词链
//每条链保存指代的集合
// 句⼦和偏移量都从1开始
castle
Map<Integer, CorefChain> corefChains = (CorefCoreAnnotations.CorefChainAnnotation.class);
if (corefChains == null) {
return;
}
for (Map.Entry<Integer, CorefChain> entry : Set()) {
System.out.println("Chain " + Key() + "");
for (CorefChain.CorefMention m : Value().getMentionsInTextualOrder()) {
// We need to subtract one since the indices count from 1 but the Lists start from 0
List<CoreLabel> tokens = (m.ntNum - 1).get(CoreAnnotations.TokensAnnotation.class);
// We subtract two for end: one for 0-bad indexing, and one becau we want last token of mention
// not one following.
System.out.println(
"" + m + ", i.e., 0-bad character offts [" + (m.startIndex - 1).beginPosition()
+
", " + (m.endIndex - 2).endPosition() + ")");
}
}
}
}
public static void main(String[] args) throws Exception {
TestNLP nlp=new TestNLP();
}
当然,我在运⾏过程中,只保留了ner相关的分析,别的功能注释掉了。输出结果如下:
19:46:16.000 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
19:46:19.387 [main] INFO e.s.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/chine-distsim/chine-distsim.tagger ... done [3.4 c].
19:46:19.388 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator lemma
19:46:19.389 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator ner
19:46:21.938 [main] INFO ie.AbstractSequenceClassifier - Loading classifier from edu/stanford/nlp/models/ner/chine. ... done [2.5 c].
19:46:22.099 [main] WARN p.TokensRegexNERAnnotator - TokensRegexNERAnnotator er: Entry has multiple types for ner: 巴伐利亚 STATE_OR_PROVINCE MISC,GPE,LOCATION 1. Taking type to be MISC
19:46:22.100 [main] WARN p.TokensRegexNERAnnotator - TokensRegexNERAnnotator er: Entry has multiple types for ner: 巴伐利亚州 STATE_OR_PROVINCE MISC,GPE,LOCATION 1. Taking type to be MISC 19:46:22.100 [main] INFO p.TokensRegexNERAnnotator - TokensRegexNERAnnotator er: Read 21238 un
ique entries out of 21249from edu/stanford/nlp/models/kbp/chine/cn_regexner_mapping.tab, 0 TokensRegex patter 19:46:22.532 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator par
19:46:35.855 [main] INFO e.s.ParrGrammar - Loading parr from rialized file edu/stanford/nlp/models/ ... done [13.3 c].
19:46:35.859 [main] INFO e.s.nlp.pipeline.StanfordCoreNLP - Adding annotator coref
19:46:43.139 [main] INFO pipeline.CorefMentionAnnotator - Using mention detector type: rule
19:46:43.148 [main] INFO e.s.nlp.wordg.ChineDictionary - Loading Chine dictionaries from1 file:
19:46:43.148 [main] INFO e.s.nlp.wordg.ChineDictionary - edu/stanford/nlp/models/gmenter/
19:46:43.329 [main] INFO e.s.nlp.wordg.ChineDictionary - Done. Unique words in ChineDictionary is: 423200.
17作业网英语
19:46:43.379 [main] INFO edu.stanford.nlp.wordg.CorpusChar - Loading character dictionary file from edu/stanford/nlp/models/gmenter/chine/dict/character_list [done].
19:46:43.380 [main] INFO e.s.nlp.wordg.AffixDictionary - Loading affix dictionary from edu/stanford/nlp/models/gmenter/chine/b [done].
袁隆平 | analysis : { original : PERSON, normalized : null}
是 | analysis : { original : O, normalized : null}
中国 | analysis : { original : ORGANIZATION, normalized : null}
科学院 | analysis : { original : ORGANIZATION, normalized : null}
的 | analysis : { original : O, normalized : null}
院⼠ | analysis : { original : TITLE, normalized : null}
, | analysis : { original : O, normalized : null}
他 | analysis : { original : O, normalized : null}
于 | analysis : { original : O, normalized : null}
2009年 | analysis : { original : DATE, normalized : 2009-10-XX}
10⽉ | analysis : { original : DATE, normalized : 2009-10-XX}
到 | analysis : { original : O, normalized : null}
中国 | analysis : { original : COUNTRY, normalized : null}
⼭东省 | analysis : { original : STATE_OR_PROVINCE, normalized : null}
东营市 | analysis : { original : CITY, normalized : null}
东营区 | analysis : { original : FACILITY, normalized : null}
永乐 | analysis : { original : FACILITY, normalized : null}
机场 | analysis : { original : FACILITY, normalized : null}
附近 | analysis : { original : O, normalized : null}
福山雅治 最爱承包 | analysis : { original : O, normalized : null}
了 | analysis : { original : O, normalized : null}
⼀千 | analysis : { original : NUMBER, normalized : 1000}
亩 | analysis : { original : O, normalized : null}
盐 | analysis : { original : O, normalized : null}
碱地 | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
开始 | analysis : { original : O, normalized : null}
种植 | analysis : { original : O, normalized : null}
棉花 | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
年产量 | analysis : { original : O, normalized : null}
达到 | analysis : { original : O, normalized : null}
⼀万 | analysis : { original : NUMBER, normalized : 10000}
吨 | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
哈哈 | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
反正 | analysis : { original : O, normalized : null}
棣琦 | analysis : { original : PERSON, normalized : null}
联合王国
说 | analysis : { original : O, normalized : null}
的 | analysis : { original : O, normalized : null}
是 | analysis : { original : O, normalized : null}
假 | analysis : { original : O, normalized : null}
的 | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
逗 | analysis : { original : O, normalized : null}
你 | analysis : { original : O, normalized : null}
玩⼉ | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
明天 | analysis : { original : DATE, normalized : XXXX-XX-XX}
下午 | analysis : { original : TIME, normalized : null}
2点 | analysis : { original : TIME, normalized : null}
来 | analysis : { original : O, normalized : null}
我 | analysis : { original : O, normalized : null}
家 | analysis : { original : O, normalized : null}
吃饭 | analysis : { original : O, normalized : null}
吧 | analysis : { original : O, normalized : null}
。 | analysis : { original : O, normalized : null}
棣琦 | analysis : { original : PERSON, normalized : null}
是 | analysis : { original : O, normalized : null}
⼭东 | analysis : { original : ORGANIZATION, normalized : null}
⼤学 | analysis : { original : ORGANIZATION, normalized : null}
毕业 | analysis : { original : O, normalized : null}
的 | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
⽬前 | analysis : { original : DATE, normalized : null}
在 | analysis : { original : O, normalized : null}
百度 | analysis : { original : ORGANIZATION, normalized : null}
做 | analysis : { original : O, normalized : null}
java | analysis : { original : O, normalized : null}
开发 | analysis : { original : O, normalized : null}
, | analysis : { original : O, normalized : null}
位置 | analysis : { original : O, normalized : null}
是 | analysis : { original : O, normalized : null}
东北 | analysis : { original : LOCATION, normalized : null}
旺 | analysis : { original : O, normalized : null}
东路 | analysis : { original : O, normalized : null}
102 | analysis : { original : NUMBER, normalized : 102}
号院 | analysis : { original : O, normalized : null}
conductor是什么意思
, | analysis : { original : O, normalized : null}
⼿机号 | analysis : { original : O, normalized : null}
143667788 | analysis : { original : NUMBER, normalized : 14366778890}
90 | analysis : { original : NUMBER, normalized : 14366778890}
The analysis lasts 819 conds * 1000
Process finished with exit code 0
我们可以看到,整个⼯程的启动耗时还是挺久的。分析过程也⽐较耗时,819毫秒。
并且结果也不够准确,跟我在其官⽹在线demo得到的结果还是有些差异的:latest bbc world news