中⽂拼写检测(ChineSpellingChecking)相关⽅法、评测任务、榜单
中⽂拼写检测(Chine Spelling Checking)相关⽅法、评测任务、榜单
中⽂拼写检测(Chine Spelling Checking,CSC)是近两年来⽐较⽕的⼩众任务,在包括ACL、EMNLP等顶会上发展迅速。本⽂简单介绍CSC任务,相关⽅法、评测任务和榜单。
⼀、中⽂拼写检测
中⽂拼写检测(Chine Spelling Checking,CSC)⼜称中⽂拼写纠错(Chine Spelling Correction,CSC),其旨在根据上下⽂来识别并纠正错误的拼写问题,起源于英⽂的拼写检测和语法错误识别问题。由于近年来中⽂NLP的发展加速,包括中⽂⽂本挖掘、中⽂预训练语⾔模型等,诸多中⽂语料或垂直领域语料中都会存在的⼀些拼写错误问题,因此提升语料质量⼗分重要。
⽬前中⽂拼写检测常⽤在如下三个场景中:
OCR识别:是指对图像类型的⽂字通过CV算法转换为UTF-8的字符。但是由于OCR属于单字独⽴识别,可能由于图像模糊、遮盖等问题导致识别出错,因此OCR识别出的⽂本可能会存在拼写错误问题。⼀般地,OCR属于视觉特征⽅⾯的⽂字识别任务,因此拼写错误通常来源于相似字形混淆。
例如“⾦属材料”可能会被错误识别为“⾦属材科”,因为“科”与“料”在字形上⾮常相似。
ASR识别:是指根据语⾳来转换为⽂字,属于语⾳识别。通常也会因为杂⾳、⽅⾔等问题,部分⾳节存在相似混淆⽽导致识别错误。
例如“星星产业”与“新兴产业”,“星星”与“新星”如果在说话者咬字不清晰的情况下是很难区分的。
意外错误:例如⼯作⼈员在键⼊信息时,可能由于敲错键盘等马虎⾏为,导致输⼊了错误的字符。
例如在输⼊“伤感”(shanggan)时,可能会误输⼊为“伤寒”(shanghan),因为“g”和“h”在键盘布局内仅靠在⼀起;
empty反义词 但是最终我们期望识别的⽂本在上下⽂是存在语义的,由于⼀些错误的拼写,我们依然可以判断他原始的正确字符。例如即便OCR错误识别为“⾦属材科”,我们依然可以根据上下⽂与先验知识来推测应为“⾦属材料”。当然也有可能是由于不同领域的问题,使得这个纠错任务并⾮完全依靠上下⽂。例如常见的搭配是“新兴产业”,但是不得排除“星星产业”是某⼀个商标或特定领域专有词汇。
因此,学术界引⼊中⽂拼写检测(CSC)来专门也就如何识别并纠错。在数据的构建上,可以直接根据混淆集来⽣成错误字符,⽽混淆集的构建则需要专门处理,如下图所⽰,可以针对对图像进⾏模糊化处理来⽣成错误的字符:
下⾯给出CSC基础概念:
混淆集(Confusion Set):是指⼀系列存在字⾳字形相似的字符集合,例如“⾃”与“⽩”、“⽈”存在字形混淆。在预测时,通常根据混淆集来召回可能的字符,再根据上下⽂预测正确的字符;
字形特征(Glyphic Feature):通常表⽰⼀个汉字的偏旁部⾸(结构特征)和笔画序列(序列特征),例如:“争”的结构特征可以描述为“⿱⿰⿻⿻⿱”,序列特征为“⼃ ⼀⼀⼅”
偏旁部⾸和笔画通常也可以描述为树形结构,如图所⽰:
字⾳特征(Phonetic Feature):通常表⽰⼀个字符的拼⾳(pinyin),例如“天”的pinyin序列为“tian1”,“盛”的pinyin序列可能为“sheng4”、“cheng2”(其中数字部分表⽰声调)。在特征提取时,通常可以将pinyin单独作为⼀个特征,或将pinyin作为⼀个序列进⾏处理。
⼆、相关⽅法
本部分简单列出最近相关论⽂(会不定时更新,如有最新稿件,可在评论区提供),如下所⽰:
【1】DCSpell:A Detector-Corrector Framework for Chine Spelling Error Correction(SIGIR2021)
【2】Tail-to-Tail Non-Autoregressive Sequence Prediction for Chine Grammatical Error Correction(ACL2021)
【3】Correcting Chine Spelling Errors with Phonetic Pre-training(ACL2021)
【4】PLOME:Pre-trained with Misspelled Knowledge for Chine Spelling Correction(ACL2021)
【5】PHMOSpell:Phonological and Morphological Knowledge Guided Chine Spelling Check(ACL2021)
【6】Exploration and Exploitation: Two Ways to Improve Chine Spelling Correction Models(ACL2021)
【7】Dynamic Connected Networks for Chine Spelling Check(2021ACL)
【8】Global Attention Decoder for Chine Spelling Error Correction(ACL2021)
【9】Read, Listen, and See: Leveraging Multimodal Information Helps Chine Spell Checking(ACL2021)
【10】SpellBERT: A Lightweight Pretrained Model for Chine Spelling Check(EMNLP2021)
【11】A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check(EMNLP2018)
【12】Adversarial Semantic Decoupling for Recognizing Open-Vocabulary Slots(EMNLP2020)
捕的组词【13】Chunk-bad Chine Spelling Check with Global Optimization(EMNLP2020)
【14】Confusiont-guided Pointer Networks for Chine Spelling Check(ACL2019)
【15】Context-Sensitive Malicious Spelling Error Correction(WWW2019)
【16】FASPell: A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm (2019ACL)【17】SpellGCN:Incorporating Phonological and Visual Similarities into Language Models for Chine Spelling Check (2020ACL)
一天作文【18】Spelling Error Correction with Soft-Masked BERT(ACL2020)
在OpenReview上提交⾄ARR2022的相关稿件有:
【1】Exploring and Adapting Chine GPT to Pinyin Input Method 【】
【2】The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chine Spell Checking 【】【】【】
【3】Sparsity Regularization for Chine Spelling Check【】
【4】Pre-Training with Syntactic Structure Prediction for Chine Semantic Error Recognition【】
【5】ECSpellUD: Zero-shot Domain Adaptive Chine Spelling Check with Ur Dictionary【】
【6】SpelLM: Augmenting Chine Spell Check Using Input Salience【】【】【】
【7】Pinyin-bert: A new solution to Chine pinyin to character conversion task【】
简单总结⼀下⽬前CSC的⽅法:
多酶片的副作用基于BERT:以为CSC时是基于token(字符)级别的预测任务,输⼊输出序列长度⼀致,因此⾮常类似预训练语⾔模型的Masked Language Modeling(MLM),因此现如今绝⼤多数的⽅法是基于MLM实现的。⽽在BERT问世前,CSC则以RNN+Decoder、CRF为主;
多模态融合:上⽂提到CSC涉及到字⾳字形,因此有⼀些⽅法则是考虑如何将Word Embedding、Glyphic Embedding和Phonetic Embedding进⾏结合,因此涌现出⼀些多模态⽅法;
三、评测任务
CSC常⽤的三个评测数据分别如下:
SIGHAN Bake-off 2013:
SIGHAN Bake-off 2014:
SIGHAN Bake-off 2015:
现如今⼤多数的CSC⽅法均涉及到预训练环节,常⽤的预训练语料为
Wang271K:
评测数据和训练语料的数据分布情况如图所⽰:
具体的实验细节可以总结如下:
(1)Pre-training语料
语料:,随机挑选1M训练,max_len=128,batch_size=1024,lr=5e-5,step=10K。
获取《A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check》构建的271K语料
(2)fine-tuning语料
SIGHAN13、SIGHAN14、SIGHAN15 直接使⽤提供的数据,其中: ● merged:表⽰SIGHAN13、SIGHAN14和SIGHAN15混合训练集(10K): ● SIGHAN13、SIGHAN14、SIGHAN15:分别表⽰
测试集
OCR 使⽤《FASPell: A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm》构建的数据集:,总共4575句⼦
(3)评测脚本:详见本⽂末
四、榜单
博主简单列出了常⽤的评测任务上的榜单(实时更新),如下表:
五、评测脚本
CSC常采⽤P、R、F1值进⾏评测,评测涉及到detection和correction两个层⾯,具体详见代码:import os import sys def convert_from_myformat_to_sighan (input_path , output_path , pred_path , orig_path =None , spellgcn =Fal ): with open (pred_path , "w") as labels_writer : with open (input_path , "r") as org_file , open (orig_path , "r") as id_f : with open (output_path , "r") as test_file : test_file = test_file .readlines () org_file = org_file .readlines () print (len (test_file ), len (org_file )) asrt len (test_file ) == len (org_file ) for k , (pred , inp , sid ) in enumerate (zip (test_file , org_file , id_f )): if spellgcn : _, atl = inp .strip ().split ("\t")
atl = atl .split (" ")[1:] pred = pred .split (" ")[1:len (atl )+1] el : atl , _, _= inp .strip ().split ("\t")[:3] atl = atl .split (" ") pred = pred .split (" ")[:len (atl )] output_list = [sid .split ()[0]] for i , (pt , at ) in enumerate (zip (pred [:], atl [:])): if at == "[SEP]" or at == '[PAD]': break # Post preprocess with unsupervid methods, #becau unsup BERT always predict punchuation at 1st pos if i == 0: if pt == "。" or pt == ",": continue if pt .startswith ("##"): pt = pt .lstrip ("##") if at .startswith ("##"): at = at .lstrip ("##") if pt != at : output_list .append (str (i +1)) output_list .append (pt ) if len (output_list ) == 1: output_list .append ("0") labels_writer .write (", ".join (output_list ) + "\n") def eval_spell (truth_path , pred_path , with_error =True ): #Compute F1-score detect_TP , detect_FP , detect_FN = 0, 0, 0 correct_TP , correct_FP , correct_FN = 0, 0, 0 detect_nt_TP , nt_P , nt_N , correct_nt_TP = 0, 0, 0, 0 dc_TP , dc_FP , dc_FN = 0, 0, 0 for idx , (pred , actual ) in enumerate (zip (open (pred_path , "r", encoding ='utf-8'), open (truth_path , "r", encoding ='utf-8') if with_error el open (truth_path , "r", encoding ='utf-8'))): pred_tokens = pred .strip ().split (" ") actual_tokens = actual .strip ().split (" ") #asrt pred_tokens[0] == actual_tokens[0] pred_tokens = pred_tokens [1:] actual_tokens = actual_tokens [1:] detect_actual_tokens = [int (actual_token .strip (",")) \ for i ,actual_token in enumerate (actual_tokens ) if i %2 ==0] correct_actual_tokens = [actual_token .strip (",") \ for i ,a
ctual_token in enumerate (actual_tokens ) if i %2 ==1] detect_pred_tokens = [int (pred_token .strip (",")) \ for i ,pred_token in enumerate (pred_tokens ) if i %2 ==0]
1
2
3
4
5
6
7
8
苏轼书法作品欣赏9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
教师职称述职报告24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
跳舞鞋舞蹈鞋
44
45
46
47
48
49
50
51
10平米
52
53
54
55
56
57
58
59
60
61
for i ,pred_token in enumerate (pred_tokens ) if i %2 ==0] _correct_pred_tokens = [pred_token .strip (",") \ for i ,pred_token in enumerate (pred_tokens ) if i %2 ==1] # Postpreprocess for ACL2019 csc paper which only deal with last detect positions in test data. # If we wanna follow the ACL2019 csc paper, we should take the detect_pred_tokens to: max_detect_pred_tokens = detect_pred_tokens correct_pred_zip = zip (detect_pred_tokens , _correct_pred_tokens ) correct_actual_zip = zip (detect_actual_tokens , correct_actual_tokens ) if detect_pred_tokens
[0] != 0: nt_P += 1 if sorted (correct_pred_zip ) == sorted (correct_actual_zip ): correct_nt_TP += 1 if detect_actual_tokens [0] != 0: if sorted (detect_actual_tokens ) == sorted (detect_pred_tokens ): detect_nt_TP += 1 nt_N += 1 if detect_actual_tokens [0]!=0: detect_TP += len (t (max_detect_pred_tokens ) & t (detect_actual_tokens )) detect_FN += len (t (detect_actual_tokens ) - t (max_detect_pred_tokens )) detect_FP += len (t (max_detect_pred_tokens ) - t (detect_actual_tokens )) correct_pred_tokens = [] #Only check the correct postion's tokens for dpt , cpt in zip (detect_pred_tokens , _correct_pred_tokens ): if dpt in detect_actual_tokens : correct_pred_tokens .append ((dpt ,cpt )) correct_TP += len (t (correct_pred_tokens ) & t (zip (detect_actual_tokens ,correct_actual_tokens ))) correct_FP += len (t (correct_pred_tokens ) - t (zip (detect_actual_tokens ,correct_actual_tokens ))) correct_FN += len (t (zip (detect_actual_tokens ,correct_actual_tokens )) - t (correct_pred_tokens )) # Caluate the correction level which depend on predictive detection of BERT dc_pred_tokens = zip (detect_pred_tokens , _correct_pred_tokens ) dc_actual_tokens = zip (detect_actual_tokens , correct_actual_tokens ) dc_TP += len (t (dc_pred_tokens ) & t (dc_actual_tokens )) dc_FP += len (t (dc_pred_tokens ) - t (dc_actual_tokens )) dc_FN += len (t (dc_actual_tokens ) - t (dc_pred_tokens )) detect_precision = detect_TP * 1.0 / (detect_TP + detect_FP ) detect_recall =
detect_TP * 1.0 / (detect_TP + detect_FN ) detect_F1 = 2. * detect_precision * detect_recall / ((detect_precision + detect_recall ) + 1e -8) correct_precision = correct_TP * 1.0 / (correct_TP + correct_FP ) correct_recall = correct_TP * 1.0 / (correct_TP + correct_FN ) correct_F1 = 2. * correct_precision * correct_recall / ((correct_precision + correct_recall ) + 1e -8) dc_precision = dc_TP * 1.0 / (dc_TP + dc_FP + 1e -8) dc_recall = dc_TP * 1.0 / (dc_TP + dc_FN + 1e -8) dc_F1 = 2. * dc_precision * dc_recall / (dc_precision + dc_recall + 1e -8) if with_error : #Token-level metrics print ("detect_precision=%f, detect_recall=%f, detect_Fscore=%f" %(detect_precision , detect_recall , detect_F1)) print ("correct_precision=%f, correct_recall=%f, correct_Fscore=%f" %(correct_precision , correct_recall , correct_F1)) print ("dc_joint_precision=%f, dc_joint_recall=%f, dc_joint_Fscore=%f" %(dc_precision , dc_recall , dc_F1)) detect_nt_precision = detect_nt_TP * 1.0 / (nt_P ) detect_nt_recall = detect_nt_TP * 1.0 / (nt_N ) detect_nt_F1 = 2. * detect_nt_precision * detect_nt_recall / ((detect_nt_precision + detect_nt_recall ) + 1e -8) correct_nt_precision = correct_nt_TP * 1.0 / (nt_P ) correct_nt_recall = correct_nt_TP * 1.0 / (nt_N ) correct_nt_F1 = 2. * correct_nt_precision * correct_nt_recall / ((correct_nt_precision + correct_nt_recall ) + 1e -8) if not with_error : #Sentence-level metrics print ("detect_nt_precision=%f, detect_nt_recall=%f, detect_Fscore=%f" %(detect_nt_precision , detect_nt_recall , detect_nt_F1)) print ("correct_nt_precision=%f, c
orrect_nt_recall=%f, correct_Fscore=%f" %(correct_nt_precision , correct_nt_recall , correct_nt_F1))if __name__ == '__main__': output_path = sys .argv [1] data_path = sys .argv [2] input_path = os .path .join (data_path , "") pred_path = os .path .join (os .path .dirname (output_path ), '') orig_input_path = os .path .join (data_path , "") orig_truth_path = os .path .join (data_path , "")616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126