中文拼写检测(ChineSpellingChecking)相关方法、评测任务、榜单

更新时间:2023-06-21 17:11:12 阅读: 评论:0

中⽂拼写检测(ChineSpellingChecking)相关⽅法、评测任务、榜单
中⽂拼写检测(Chine Spelling Checking)相关⽅法、评测任务、榜单
  中⽂拼写检测(Chine Spelling Checking,CSC)是近两年来⽐较⽕的⼩众任务,在包括ACL、EMNLP等顶会上发展迅速。本⽂简单介绍CSC任务,相关⽅法、评测任务和榜单。
led摄影灯
⼀、中⽂拼写检测
  中⽂拼写检测(Chine Spelling Checking,CSC)⼜称中⽂拼写纠错(Chine Spelling Correction,CSC),其旨在根据上下⽂来识别并纠正错误的拼写问题,起源于英⽂的拼写检测和语法错误识别问题。由于近年来中⽂NLP的发展加速,包括中⽂⽂本挖掘、中⽂预训练语⾔模型等,诸多中⽂语料或垂直领域语料中都会存在的⼀些拼写错误问题,因此提升语料质量⼗分重要。
  ⽬前中⽂拼写检测常⽤在如下三个场景中:
OCR识别:是指对图像类型的⽂字通过CV算法转换为UTF-8的字符。但是由于OCR属于单字独⽴识别,可能由于图像模糊、遮盖等问题导致识别出错,因此OCR识别出的⽂本可能会存在拼写错误问题。⼀般地,OCR属于视觉特征⽅⾯的⽂字识别任务,因此拼写错误通常来源于相似字形混淆。
例如“⾦属材料”可能会被错误识别为“⾦属材科”,因为“科”与“料”在字形上⾮常相似。
ASR识别:是指根据语⾳来转换为⽂字,属于语⾳识别。通常也会因为杂⾳、⽅⾔等问题,部分⾳节存在相似混淆⽽导致识别错误。
例如“星星产业”与“新兴产业”,“星星”与“新星”如果在说话者咬字不清晰的情况下是很难区分的。
意外错误:例如⼯作⼈员在键⼊信息时,可能由于敲错键盘等马虎⾏为,导致输⼊了错误的字符。
例如在输⼊“伤感”(shanggan)时,可能会误输⼊为“伤寒”(shanghan),因为“g”和“h”在键盘布局内仅靠在⼀起;
  但是最终我们期望识别的⽂本在上下⽂是存在语义的,由于⼀些错误的拼写,我们依然可以判断他原始的正确字符。例如即便OCR错误识别为“⾦属材科”,我们依然可以根据上下⽂与先验知识来推测应为“⾦属材料”。当然也有可能是由于不同领域的问题,使得这个纠错任务并⾮完全依靠上下⽂。例如常见的搭配是“新兴产业”,但是不得排除“星星产业”是某⼀个商标或特定领域专有词汇。
每天问候语  因此,学术界引⼊中⽂拼写检测(CSC)来专门也就如何识别并纠错。在数据的构建上,可以直接根据混淆集来⽣成错误字符,⽽混淆集的构建则需要专门处理,如下图所⽰,可以针对对图像进⾏模糊化处理来⽣成错误的字符:
  下⾯给出CSC基础概念:
混淆集(Confusion Set):是指⼀系列存在字⾳字形相似的字符集合,例如“⾃”与“⽩”、“⽈”存在字形混淆。在预测时,通常根据混淆集来召回可能的字符,再根据上下⽂预测正确的字符;
字形特征(Glyphic Feature):通常表⽰⼀个汉字的偏旁部⾸(结构特征)和笔画序列(序列特征),例如:“争”的结构特征可以描述为“⿱⿰⿻⿻⿱”,序列特征为“⼃  ⼀⼀⼅”
偏旁部⾸和笔画通常也可以描述为树形结构,如图所⽰:
字⾳特征(Phonetic Feature):通常表⽰⼀个字符的拼⾳(pinyin),例如“天”的pinyin序列为“tian1”,“盛”的pinyin序列可能为“sheng4”、“cheng2”(其中数字部分表⽰声调)。在特征提取时,通常可以将pinyin单独作为⼀个特征,或将pinyin作为⼀个序列进⾏处理。
⼆、相关⽅法
  本部分简单列出最近相关论⽂(会不定时更新,如有最新稿件,可在评论区提供),如下所⽰:
【1】DCSpell:A Detector-Corrector Framework for Chine Spelling Error Correction(SIGIR2021)
【2】Tail-to-Tail Non-Autoregressive Sequence Prediction for Chine Grammatical Error Correction(ACL2021)
【3】Correcting Chine Spelling Errors with Phonetic Pre-training(ACL2021)
【4】PLOME:Pre-trained with Misspelled Knowledge for Chine Spelling Correction(ACL2021)
【5】PHMOSpell:Phonological and Morphological Knowledge Guided Chine Spelling Check(ACL2021)
物流运营
【6】Exploration and Exploitation: Two Ways to Improve Chine Spelling Correction Models(ACL2021)
【7】Dynamic Connected Networks for Chine Spelling Check(2021ACL)
【8】Global Attention Decoder for Chine Spelling Error Correction(ACL2021)
【9】Read, Listen, and See: Leveraging Multimodal Information Helps Chine Spell Checking(ACL2021)
【10】SpellBERT: A Lightweight Pretrained Model for Chine Spelling Check(EMNLP2021)
【11】A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check(EMNLP2018)
骨质宁搽剂【12】Adversarial Semantic Decoupling for Recognizing Open-Vocabulary Slots(EMNLP2020)
【13】Chunk-bad Chine Spelling Check with Global Optimization(EMNLP2020)
【14】Confusiont-guided Pointer Networks for Chine Spelling Check(ACL2019)
【15】Context-Sensitive Malicious Spelling Error Correction(WWW2019)
别往下看
【16】FASPell: A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm (2019ACL)【17】SpellGCN:Incorporating Phonological and Visual Similarities into Language Models for Chine Spelling Check (2020ACL)
【18】Spelling Error Correction with Soft-Masked BERT(ACL2020)
在OpenReview上提交⾄ARR2022的相关稿件有:
【1】Exploring and Adapting Chine GPT to Pinyin Input Method 【】
【2】The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chine Spell Checking 【】【】【】
英语作文万能【3】Sparsity Regularization for Chine Spelling Check【】
【4】Pre-Training with Syntactic Structure Prediction for Chine Semantic Error Recognition【】
【5】ECSpellUD: Zero-shot Domain Adaptive Chine Spelling Check with Ur Dictionary【】
【6】SpelLM: Augmenting Chine Spell Check Using Input Salience【】【】【】
【7】Pinyin-bert: A new solution to Chine pinyin to character conversion task【】
简单总结⼀下⽬前CSC的⽅法:
基于BERT:以为CSC时是基于token(字符)级别的预测任务,输⼊输出序列长度⼀致,因此⾮常类似预训练语⾔模型的Masked Language Modeling(MLM),因此现如今绝⼤多数的⽅法是基于MLM实现的。⽽在BERT问世前,CSC则以RNN+Decoder、CRF为主;
多模态融合:上⽂提到CSC涉及到字⾳字形,因此有⼀些⽅法则是考虑如何将Word Embedding、Glyphic Embedding和Phonetic Embedding进⾏结合,因此涌现出⼀些多模态⽅法;
三、评测任务
  CSC常⽤的三个评测数据分别如下:
SIGHAN Bake-off 2013:
SIGHAN Bake-off 2014:
SIGHAN Bake-off 2015:
  现如今⼤多数的CSC⽅法均涉及到预训练环节,常⽤的预训练语料为
Wang271K:
  评测数据和训练语料的数据分布情况如图所⽰:
  具体的实验细节可以总结如下:
(1)Pre-training语料
语料:,随机挑选1M训练,max_len=128,batch_size=1024,lr=5e-5,step=10K。
获取《A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check》构建的271K语料
(2)fine-tuning语料
SIGHAN13、SIGHAN14、SIGHAN15 直接使⽤提供的数据,其中: ● merged:表⽰SIGHAN13、SIGHAN14和SIGHAN15混合训练集(10K): ● SIGHAN13、SIGHAN14、SIGHAN15:分别表⽰
测试集
OCR 使⽤《FASPell: A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm》构建的数据集:,总共4575句⼦
(3)评测脚本:详见本⽂末
四、榜单
  博主简单列出了常⽤的评测任务上的榜单(实时更新),如下表:
五、评测脚本
  CSC常采⽤P、R、F1值进⾏评测,评测涉及到detection和correction两个层⾯,具体详见代码:import  os import  sys def  convert_from_myformat_to_sighan (input_path , output_path , pred_path , orig_path =None , spellgcn =Fal ):  with  open (pred_path , "w") as  labels_writer :    with  open (input_path , "r") as  org_file , open (orig_path , "r") as  id_f :      with  open (output_path , "r") as  test_file :        test_file = test_file .readlines ()        org_file = org_file .readlines ()        print (len (test_file ), len (org_file ))        asrt  len (test_file ) == len (org_file )        for  k , (pred , inp , sid ) in  enumerate (zip (test_file , org_file , id_f )):          if  spellgcn :            _, atl = inp .strip ().split ("\t")       
    atl = atl .split (" ")[1:]            pred = pred .split (" ")[1:len (atl )+1]          el :            atl , _, _= inp .strip ().split ("\t")[:3]            atl = atl .split (" ")            pred = pred .split (" ")[:len (atl )]          output_list = [sid .split ()[0]]          for  i , (pt , at ) in  enumerate (zip (pred [:], atl [:])):            if  at == "[SEP]" or  at == '[PAD]':              break            # Post preprocess with unsupervid methods,      #becau unsup BERT always predict punchuation at 1st pos            if  i == 0:              if  pt == "。" or  pt == ",":                continue            if  pt .startswith ("##"):                pt = pt .lstrip ("##")              if  at .startswith ("##"):                at = at .lstrip ("##")              if  pt != at :              output_list .append (str (i +1))              output_list .append (pt )                        if  len (output_list ) == 1:            output_list .append ("0")          labels_writer .write (", ".join (output_list ) + "\n") def  eval_spell (truth_path , pred_path , with_error =True ):  #Compute F1-score  detect_TP , detect_FP , detect_FN = 0, 0, 0  correct_TP , correct_FP , correct_FN = 0, 0, 0  detect_nt_TP , nt_P , nt_N , correct_nt_TP = 0, 0, 0, 0  dc_TP , dc_FP , dc_FN = 0, 0, 0  for  idx , (pred , actual ) in  enumerate (zip (open (pred_path , "r", encoding ='utf-8'),    open (truth_path , "r", encoding ='utf-8') if  with_error el    open (truth_path , "r", encoding ='utf-8'))):    pred_tokens = pred .strip ().split (" ")    actual_tokens = actual .strip ().split (" ")    #asrt pred_tokens[0] == actual_tokens[0]    pred_tokens = pred_tokens [1:]    actual_tokens = actual_tokens [1:]    detect_actual_tokens = [int (actual_token .strip (",")) \  for  i ,actual_token in  enumerate (actual_tokens ) if  i %2 ==0]    correct_actual_tokens = [actual_token .strip (",") \  for  i ,a
ctual_token in  enumerate (actual_tokens ) if  i %2 ==1]    detect_pred_tokens = [int (pred_token .strip (",")) \  for  i ,pred_token in  enumerate (pred_tokens ) if  i %2 ==0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
幼儿发展指南41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
for  i ,pred_token in  enumerate (pred_tokens ) if  i %2 ==0]    _correct_pred_tokens = [pred_token .strip (",") \  for  i ,pred_token in  enumerate (pred_tokens ) if  i %2 ==1]    # Postpreprocess for ACL2019 csc paper which only deal with last detect positions in test data.    # If we wanna follow the ACL2019 csc paper, we should take the detect_pred_tokens to:    max_detect_pred_tokens = detect_pred_tokens    correct_pred_zip = zip (detect_pred_tokens , _correct_pred_tokens )    correct_actual_zip = zip (detect_actual_tokens , correct_actual_tokens )          if  detect_pred_tokens
[0] !=  0:      nt_P += 1      if  sorted (correct_pred_zip ) == sorted (correct_actual_zip ):        correct_nt_TP += 1    if  detect_actual_tokens [0] != 0:      if  sorted (detect_actual_tokens ) == sorted (detect_pred_tokens ):        detect_nt_TP += 1      nt_N += 1    if  detect_actual_tokens [0]!=0:      detect_TP += len (t (max_detect_pred_tokens ) & t (detect_actual_tokens ))      detect_FN += len (t (detect_actual_tokens ) - t (max_detect_pred_tokens ))    detect_FP += len (t (max_detect_pred_tokens ) - t (detect_actual_tokens ))        correct_pred_tokens = []    #Only check the correct postion's tokens    for  dpt , cpt in  zip (detect_pred_tokens , _correct_pred_tokens ):      if  dpt in  detect_actual_tokens :        correct_pred_tokens .append ((dpt ,cpt ))    correct_TP += len (t (correct_pred_tokens ) & t (zip (detect_actual_tokens ,correct_actual_tokens )))    correct_FP += len (t (correct_pred_tokens ) - t (zip (detect_actual_tokens ,correct_actual_tokens )))    correct_FN += len (t (zip (detect_actual_tokens ,correct_actual_tokens )) - t (correct_pred_tokens ))    # Caluate the correction level which depend on predictive detection of BERT    dc_pred_tokens = zip (detect_pred_tokens , _correct_pred_tokens )    dc_actual_tokens = zip (detect_actual_tokens , correct_actual_tokens )    dc_TP += len (t (dc_pred_tokens ) & t (dc_actual_tokens ))    dc_FP += len (t (dc_pred_tokens ) - t (dc_actual_tokens ))    dc_FN += len (t (dc_actual_tokens ) - t (dc_pred_tokens ))    detect_precision = detect_TP * 1.0 / (detect_TP + detect_FP )  detect_recall =
detect_TP * 1.0 / (detect_TP + detect_FN )  detect_F1 = 2. * detect_precision * detect_recall / ((detect_precision + detect_recall ) + 1e -8)  correct_precision = correct_TP * 1.0 / (correct_TP + correct_FP )  correct_recall = correct_TP * 1.0 / (correct_TP + correct_FN )  correct_F1 = 2. * correct_precision * correct_recall / ((correct_precision + correct_recall ) + 1e -8)  dc_precision = dc_TP * 1.0 / (dc_TP + dc_FP + 1e -8)  dc_recall = dc_TP * 1.0 / (dc_TP + dc_FN + 1e -8)  dc_F1 = 2. * dc_precision * dc_recall / (dc_precision + dc_recall + 1e -8)  if  with_error :    #Token-level metrics    print ("detect_precision=%f, detect_recall=%f, detect_Fscore=%f" %(detect_precision , detect_recall , detect_F1))    print ("correct_precision=%f, correct_recall=%f, correct_Fscore=%f" %(correct_precision , correct_recall , correct_F1))      print ("dc_joint_precision=%f, dc_joint_recall=%f, dc_joint_Fscore=%f" %(dc_precision , dc_recall , dc_F1))    detect_nt_precision = detect_nt_TP * 1.0 / (nt_P )  detect_nt_recall = detect_nt_TP * 1.0 / (nt_N )  detect_nt_F1 = 2. * detect_nt_precision * detect_nt_recall / ((detect_nt_precision + detect_nt_recall ) + 1e -8)  correct_nt_precision = correct_nt_TP * 1.0 / (nt_P )  correct_nt_recall = correct_nt_TP * 1.0 / (nt_N )  correct_nt_F1 = 2. * correct_nt_precision * correct_nt_recall / ((correct_nt_precision + correct_nt_recall ) + 1e -8)  if  not  with_error :    #Sentence-level metrics    print ("detect_nt_precision=%f, detect_nt_recall=%f, detect_Fscore=%f" %(detect_nt_precision , detect_nt_recall , detect_nt_F1))    print ("correct_nt_precision=%f, c
西游记阅读感受orrect_nt_recall=%f, correct_Fscore=%f" %(correct_nt_precision , correct_nt_recall , correct_nt_F1))if  __name__ == '__main__':  output_path = sys .argv [1]  data_path = sys .argv [2]  input_path = os .path .join (data_path , "")  pred_path = os .path .join (os .path .dirname (output_path ), '')  orig_input_path = os .path .join (data_path , "")  orig_truth_path = os .path .join (data_path , "")616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126

本文发布于:2023-06-21 17:11:12,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/82/1007268.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:识别   评测   语料   任务   可能   问题
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图