中⽂拼写检测(ChineSpellingChecking)相关⽅法、评测任务、榜单
中⽂拼写检测(Chine Spelling Checking)相关⽅法、评测任务、榜单
中⽂拼写检测(Chine Spelling Checking,CSC)是近两年来⽐较⽕的⼩众任务,在包括ACL、EMNLP等顶会上发展迅速。本⽂简单介绍CSC任务,相关⽅法、评测任务和榜单。
led摄影灯
⼀、中⽂拼写检测
中⽂拼写检测(Chine Spelling Checking,CSC)⼜称中⽂拼写纠错(Chine Spelling Correction,CSC),其旨在根据上下⽂来识别并纠正错误的拼写问题,起源于英⽂的拼写检测和语法错误识别问题。由于近年来中⽂NLP的发展加速,包括中⽂⽂本挖掘、中⽂预训练语⾔模型等,诸多中⽂语料或垂直领域语料中都会存在的⼀些拼写错误问题,因此提升语料质量⼗分重要。
⽬前中⽂拼写检测常⽤在如下三个场景中:
OCR识别:是指对图像类型的⽂字通过CV算法转换为UTF-8的字符。但是由于OCR属于单字独⽴识别,可能由于图像模糊、遮盖等问题导致识别出错,因此OCR识别出的⽂本可能会存在拼写错误问题。⼀般地,OCR属于视觉特征⽅⾯的⽂字识别任务,因此拼写错误通常来源于相似字形混淆。
例如“⾦属材料”可能会被错误识别为“⾦属材科”,因为“科”与“料”在字形上⾮常相似。
ASR识别:是指根据语⾳来转换为⽂字,属于语⾳识别。通常也会因为杂⾳、⽅⾔等问题,部分⾳节存在相似混淆⽽导致识别错误。
例如“星星产业”与“新兴产业”,“星星”与“新星”如果在说话者咬字不清晰的情况下是很难区分的。
意外错误:例如⼯作⼈员在键⼊信息时,可能由于敲错键盘等马虎⾏为,导致输⼊了错误的字符。
例如在输⼊“伤感”(shanggan)时,可能会误输⼊为“伤寒”(shanghan),因为“g”和“h”在键盘布局内仅靠在⼀起;
但是最终我们期望识别的⽂本在上下⽂是存在语义的,由于⼀些错误的拼写,我们依然可以判断他原始的正确字符。例如即便OCR错误识别为“⾦属材科”,我们依然可以根据上下⽂与先验知识来推测应为“⾦属材料”。当然也有可能是由于不同领域的问题,使得这个纠错任务并⾮完全依靠上下⽂。例如常见的搭配是“新兴产业”,但是不得排除“星星产业”是某⼀个商标或特定领域专有词汇。
每天问候语 因此,学术界引⼊中⽂拼写检测(CSC)来专门也就如何识别并纠错。在数据的构建上,可以直接根据混淆集来⽣成错误字符,⽽混淆集的构建则需要专门处理,如下图所⽰,可以针对对图像进⾏模糊化处理来⽣成错误的字符:
下⾯给出CSC基础概念:
混淆集(Confusion Set):是指⼀系列存在字⾳字形相似的字符集合,例如“⾃”与“⽩”、“⽈”存在字形混淆。在预测时,通常根据混淆集来召回可能的字符,再根据上下⽂预测正确的字符;
字形特征(Glyphic Feature):通常表⽰⼀个汉字的偏旁部⾸(结构特征)和笔画序列(序列特征),例如:“争”的结构特征可以描述为“⿱⿰⿻⿻⿱”,序列特征为“⼃ ⼀⼀⼅”
偏旁部⾸和笔画通常也可以描述为树形结构,如图所⽰:
字⾳特征(Phonetic Feature):通常表⽰⼀个字符的拼⾳(pinyin),例如“天”的pinyin序列为“tian1”,“盛”的pinyin序列可能为“sheng4”、“cheng2”(其中数字部分表⽰声调)。在特征提取时,通常可以将pinyin单独作为⼀个特征,或将pinyin作为⼀个序列进⾏处理。
⼆、相关⽅法
本部分简单列出最近相关论⽂(会不定时更新,如有最新稿件,可在评论区提供),如下所⽰:
【1】DCSpell:A Detector-Corrector Framework for Chine Spelling Error Correction(SIGIR2021)
【2】Tail-to-Tail Non-Autoregressive Sequence Prediction for Chine Grammatical Error Correction(ACL2021)
【3】Correcting Chine Spelling Errors with Phonetic Pre-training(ACL2021)
【4】PLOME:Pre-trained with Misspelled Knowledge for Chine Spelling Correction(ACL2021)
【5】PHMOSpell:Phonological and Morphological Knowledge Guided Chine Spelling Check(ACL2021)
物流运营
【6】Exploration and Exploitation: Two Ways to Improve Chine Spelling Correction Models(ACL2021)
【7】Dynamic Connected Networks for Chine Spelling Check(2021ACL)
【8】Global Attention Decoder for Chine Spelling Error Correction(ACL2021)
【9】Read, Listen, and See: Leveraging Multimodal Information Helps Chine Spell Checking(ACL2021)
【10】SpellBERT: A Lightweight Pretrained Model for Chine Spelling Check(EMNLP2021)
【11】A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check(EMNLP2018)
骨质宁搽剂【12】Adversarial Semantic Decoupling for Recognizing Open-Vocabulary Slots(EMNLP2020)
【13】Chunk-bad Chine Spelling Check with Global Optimization(EMNLP2020)
【14】Confusiont-guided Pointer Networks for Chine Spelling Check(ACL2019)
【15】Context-Sensitive Malicious Spelling Error Correction(WWW2019)
别往下看
【16】FASPell: A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm (2019ACL)【17】SpellGCN:Incorporating Phonological and Visual Similarities into Language Models for Chine Spelling Check (2020ACL)
【18】Spelling Error Correction with Soft-Masked BERT(ACL2020)
在OpenReview上提交⾄ARR2022的相关稿件有:
【1】Exploring and Adapting Chine GPT to Pinyin Input Method 【】
【2】The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chine Spell Checking 【】【】【】
英语作文万能【3】Sparsity Regularization for Chine Spelling Check【】
【4】Pre-Training with Syntactic Structure Prediction for Chine Semantic Error Recognition【】
【5】ECSpellUD: Zero-shot Domain Adaptive Chine Spelling Check with Ur Dictionary【】
【6】SpelLM: Augmenting Chine Spell Check Using Input Salience【】【】【】
【7】Pinyin-bert: A new solution to Chine pinyin to character conversion task【】
简单总结⼀下⽬前CSC的⽅法:
基于BERT:以为CSC时是基于token(字符)级别的预测任务,输⼊输出序列长度⼀致,因此⾮常类似预训练语⾔模型的Masked Language Modeling(MLM),因此现如今绝⼤多数的⽅法是基于MLM实现的。⽽在BERT问世前,CSC则以RNN+Decoder、CRF为主;
多模态融合:上⽂提到CSC涉及到字⾳字形,因此有⼀些⽅法则是考虑如何将Word Embedding、Glyphic Embedding和Phonetic Embedding进⾏结合,因此涌现出⼀些多模态⽅法;
三、评测任务
CSC常⽤的三个评测数据分别如下:
SIGHAN Bake-off 2013:
SIGHAN Bake-off 2014:
SIGHAN Bake-off 2015:
现如今⼤多数的CSC⽅法均涉及到预训练环节,常⽤的预训练语料为
Wang271K:
评测数据和训练语料的数据分布情况如图所⽰:
具体的实验细节可以总结如下:
(1)Pre-training语料
语料:,随机挑选1M训练,max_len=128,batch_size=1024,lr=5e-5,step=10K。
获取《A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check》构建的271K语料
(2)fine-tuning语料
SIGHAN13、SIGHAN14、SIGHAN15 直接使⽤提供的数据,其中: ● merged:表⽰SIGHAN13、SIGHAN14和SIGHAN15混合训练集(10K): ● SIGHAN13、SIGHAN14、SIGHAN15:分别表⽰
测试集
OCR 使⽤《FASPell: A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm》构建的数据集:,总共4575句⼦
(3)评测脚本:详见本⽂末
四、榜单
博主简单列出了常⽤的评测任务上的榜单(实时更新),如下表:
五、评测脚本
CSC常采⽤P、R、F1值进⾏评测,评测涉及到detection和correction两个层⾯,具体详见代码:import os import sys def convert_from_myformat_to_sighan (input_path , output_path , pred_path , orig_path =None , spellgcn =Fal ): with open (pred_path , "w") as labels_writer : with open (input_path , "r") as org_file , open (orig_path , "r") as id_f : with open (output_path , "r") as test_file : test_file = test_file .readlines () org_file = org_file .readlines () print (len (test_file ), len (org_file )) asrt len (test_file ) == len (org_file ) for k , (pred , inp , sid ) in enumerate (zip (test_file , org_file , id_f )): if spellgcn : _, atl = inp .strip ().split ("\t")
atl = atl .split (" ")[1:] pred = pred .split (" ")[1:len (atl )+1] el : atl , _, _= inp .strip ().split ("\t")[:3] atl = atl .split (" ") pred = pred .split (" ")[:len (atl )] output_list = [sid .split ()[0]] for i , (pt , at ) in enumerate (zip (pred [:], atl [:])): if at == "[SEP]" or at == '[PAD]': break # Post preprocess with unsupervid methods, #becau unsup BERT always predict punchuation at 1st pos if i == 0: if pt == "。" or pt == ",": continue if pt .startswith ("##"): pt = pt .lstrip ("##") if at .startswith ("##"): at = at .lstrip ("##") if pt != at : output_list .append (str (i +1)) output_list .append (pt ) if len (output_list ) == 1: output_list .append ("0") labels_writer .write (", ".join (output_list ) + "\n") def eval_spell (truth_path , pred_path , with_error =True ): #Compute F1-score detect_TP , detect_FP , detect_FN = 0, 0, 0 correct_TP , correct_FP , correct_FN = 0, 0, 0 detect_nt_TP , nt_P , nt_N , correct_nt_TP = 0, 0, 0, 0 dc_TP , dc_FP , dc_FN = 0, 0, 0 for idx , (pred , actual ) in enumerate (zip (open (pred_path , "r", encoding ='utf-8'), open (truth_path , "r", encoding ='utf-8') if with_error el open (truth_path , "r", encoding ='utf-8'))): pred_tokens = pred .strip ().split (" ") actual_tokens = actual .strip ().split (" ") #asrt pred_tokens[0] == actual_tokens[0] pred_tokens = pred_tokens [1:] actual_tokens = actual_tokens [1:] detect_actual_tokens = [int (actual_token .strip (",")) \ for i ,actual_token in enumerate (actual_tokens ) if i %2 ==0] correct_actual_tokens = [actual_token .strip (",") \ for i ,a
ctual_token in enumerate (actual_tokens ) if i %2 ==1] detect_pred_tokens = [int (pred_token .strip (",")) \ for i ,pred_token in enumerate (pred_tokens ) if i %2 ==0]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
幼儿发展指南41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
for i ,pred_token in enumerate (pred_tokens ) if i %2 ==0] _correct_pred_tokens = [pred_token .strip (",") \ for i ,pred_token in enumerate (pred_tokens ) if i %2 ==1] # Postpreprocess for ACL2019 csc paper which only deal with last detect positions in test data. # If we wanna follow the ACL2019 csc paper, we should take the detect_pred_tokens to: max_detect_pred_tokens = detect_pred_tokens correct_pred_zip = zip (detect_pred_tokens , _correct_pred_tokens ) correct_actual_zip = zip (detect_actual_tokens , correct_actual_tokens ) if detect_pred_tokens
[0] != 0: nt_P += 1 if sorted (correct_pred_zip ) == sorted (correct_actual_zip ): correct_nt_TP += 1 if detect_actual_tokens [0] != 0: if sorted (detect_actual_tokens ) == sorted (detect_pred_tokens ): detect_nt_TP += 1 nt_N += 1 if detect_actual_tokens [0]!=0: detect_TP += len (t (max_detect_pred_tokens ) & t (detect_actual_tokens )) detect_FN += len (t (detect_actual_tokens ) - t (max_detect_pred_tokens )) detect_FP += len (t (max_detect_pred_tokens ) - t (detect_actual_tokens )) correct_pred_tokens = [] #Only check the correct postion's tokens for dpt , cpt in zip (detect_pred_tokens , _correct_pred_tokens ): if dpt in detect_actual_tokens : correct_pred_tokens .append ((dpt ,cpt )) correct_TP += len (t (correct_pred_tokens ) & t (zip (detect_actual_tokens ,correct_actual_tokens ))) correct_FP += len (t (correct_pred_tokens ) - t (zip (detect_actual_tokens ,correct_actual_tokens ))) correct_FN += len (t (zip (detect_actual_tokens ,correct_actual_tokens )) - t (correct_pred_tokens )) # Caluate the correction level which depend on predictive detection of BERT dc_pred_tokens = zip (detect_pred_tokens , _correct_pred_tokens ) dc_actual_tokens = zip (detect_actual_tokens , correct_actual_tokens ) dc_TP += len (t (dc_pred_tokens ) & t (dc_actual_tokens )) dc_FP += len (t (dc_pred_tokens ) - t (dc_actual_tokens )) dc_FN += len (t (dc_actual_tokens ) - t (dc_pred_tokens )) detect_precision = detect_TP * 1.0 / (detect_TP + detect_FP ) detect_recall =
detect_TP * 1.0 / (detect_TP + detect_FN ) detect_F1 = 2. * detect_precision * detect_recall / ((detect_precision + detect_recall ) + 1e -8) correct_precision = correct_TP * 1.0 / (correct_TP + correct_FP ) correct_recall = correct_TP * 1.0 / (correct_TP + correct_FN ) correct_F1 = 2. * correct_precision * correct_recall / ((correct_precision + correct_recall ) + 1e -8) dc_precision = dc_TP * 1.0 / (dc_TP + dc_FP + 1e -8) dc_recall = dc_TP * 1.0 / (dc_TP + dc_FN + 1e -8) dc_F1 = 2. * dc_precision * dc_recall / (dc_precision + dc_recall + 1e -8) if with_error : #Token-level metrics print ("detect_precision=%f, detect_recall=%f, detect_Fscore=%f" %(detect_precision , detect_recall , detect_F1)) print ("correct_precision=%f, correct_recall=%f, correct_Fscore=%f" %(correct_precision , correct_recall , correct_F1)) print ("dc_joint_precision=%f, dc_joint_recall=%f, dc_joint_Fscore=%f" %(dc_precision , dc_recall , dc_F1)) detect_nt_precision = detect_nt_TP * 1.0 / (nt_P ) detect_nt_recall = detect_nt_TP * 1.0 / (nt_N ) detect_nt_F1 = 2. * detect_nt_precision * detect_nt_recall / ((detect_nt_precision + detect_nt_recall ) + 1e -8) correct_nt_precision = correct_nt_TP * 1.0 / (nt_P ) correct_nt_recall = correct_nt_TP * 1.0 / (nt_N ) correct_nt_F1 = 2. * correct_nt_precision * correct_nt_recall / ((correct_nt_precision + correct_nt_recall ) + 1e -8) if not with_error : #Sentence-level metrics print ("detect_nt_precision=%f, detect_nt_recall=%f, detect_Fscore=%f" %(detect_nt_precision , detect_nt_recall , detect_nt_F1)) print ("correct_nt_precision=%f, c
西游记阅读感受orrect_nt_recall=%f, correct_Fscore=%f" %(correct_nt_precision , correct_nt_recall , correct_nt_F1))if __name__ == '__main__': output_path = sys .argv [1] data_path = sys .argv [2] input_path = os .path .join (data_path , "") pred_path = os .path .join (os .path .dirname (output_path ), '') orig_input_path = os .path .join (data_path , "") orig_truth_path = os .path .join (data_path , "")616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126