中文拼写检测（ChineSpellingChecking）相关方法、评测任务、榜单

更新时间:2023-07-04 17:18:43 阅读：评论：0

中⽂拼写检测（ChineSpellingChecking）相关⽅法、评测任务、榜单

中⽂拼写检测（Chine Spelling Checking）相关⽅法、评测任务、榜单

中⽂拼写检测（Chine Spelling Checking，CSC）是近两年来⽐较⽕的⼩众任务，在包括ACL、EMNLP等顶会上发展迅速。本⽂简单介绍CSC任务，相关⽅法、评测任务和榜单。

⼀、中⽂拼写检测

中⽂拼写检测（Chine Spelling Checking，CSC）⼜称中⽂拼写纠错（Chine Spelling Correction，CSC），其旨在根据上下⽂来识别并纠正错误的拼写问题，起源于英⽂的拼写检测和语法错误识别问题。由于近年来中⽂NLP的发展加速，包括中⽂⽂本挖掘、中⽂预训练语⾔模型等，诸多中⽂语料或垂直领域语料中都会存在的⼀些拼写错误问题，因此提升语料质量⼗分重要。

⽬前中⽂拼写检测常⽤在如下三个场景中：

OCR识别：是指对图像类型的⽂字通过CV算法转换为UTF-8的字符。但是由于OCR属于单字独⽴识别，可能由于图像模糊、遮盖等问题导致识别出错，因此OCR识别出的⽂本可能会存在拼写错误问题。⼀般地，OCR属于视觉特征⽅⾯的⽂字识别任务，因此拼写错误通常来源于相似字形混淆。

例如“⾦属材料”可能会被错误识别为“⾦属材科”，因为“科”与“料”在字形上⾮常相似。

ASR识别：是指根据语⾳来转换为⽂字，属于语⾳识别。通常也会因为杂⾳、⽅⾔等问题，部分⾳节存在相似混淆⽽导致识别错误。

例如“星星产业”与“新兴产业”，“星星”与“新星”如果在说话者咬字不清晰的情况下是很难区分的。

意外错误：例如⼯作⼈员在键⼊信息时，可能由于敲错键盘等马虎⾏为，导致输⼊了错误的字符。

例如在输⼊“伤感”（shanggan）时，可能会误输⼊为“伤寒”（shanghan），因为“g”和“h”在键盘布局内仅靠在⼀起；

empty反义词但是最终我们期望识别的⽂本在上下⽂是存在语义的，由于⼀些错误的拼写，我们依然可以判断他原始的正确字符。例如即便OCR错误识别为“⾦属材科”，我们依然可以根据上下⽂与先验知识来推测应为“⾦属材料”。当然也有可能是由于不同领域的问题，使得这个纠错任务并⾮完全依靠上下⽂。例如常见的搭配是“新兴产业”，但是不得排除“星星产业”是某⼀个商标或特定领域专有词汇。

因此，学术界引⼊中⽂拼写检测（CSC）来专门也就如何识别并纠错。在数据的构建上，可以直接根据混淆集来⽣成错误字符，⽽混淆集的构建则需要专门处理，如下图所⽰，可以针对对图像进⾏模糊化处理来⽣成错误的字符：

下⾯给出CSC基础概念：

混淆集（Confusion Set）：是指⼀系列存在字⾳字形相似的字符集合，例如“⾃”与“⽩”、“⽈”存在字形混淆。在预测时，通常根据混淆集来召回可能的字符，再根据上下⽂预测正确的字符；

字形特征（Glyphic Feature）：通常表⽰⼀个汉字的偏旁部⾸（结构特征）和笔画序列（序列特征），例如：“争”的结构特征可以描述为“⿱⿰⿻⿻⿱”，序列特征为“⼃⼀⼀⼅”

偏旁部⾸和笔画通常也可以描述为树形结构，如图所⽰：

字⾳特征（Phonetic Feature）：通常表⽰⼀个字符的拼⾳（pinyin），例如“天”的pinyin序列为“tian1”，“盛”的pinyin序列可能为“sheng4”、“cheng2”（其中数字部分表⽰声调）。在特征提取时，通常可以将pinyin单独作为⼀个特征，或将pinyin作为⼀个序列进⾏处理。

⼆、相关⽅法

本部分简单列出最近相关论⽂（会不定时更新，如有最新稿件，可在评论区提供），如下所⽰：

【1】DCSpell：A Detector-Corrector Framework for Chine Spelling Error Correction（SIGIR2021）

【2】Tail-to-Tail Non-Autoregressive Sequence Prediction for Chine Grammatical Error Correction（ACL2021）

【3】Correcting Chine Spelling Errors with Phonetic Pre-training（ACL2021）

【4】PLOME：Pre-trained with Misspelled Knowledge for Chine Spelling Correction（ACL2021）

【5】PHMOSpell：Phonological and Morphological Knowledge Guided Chine Spelling Check（ACL2021）

【6】Exploration and Exploitation: Two Ways to Improve Chine Spelling Correction Models（ACL2021）

【7】Dynamic Connected Networks for Chine Spelling Check（2021ACL）

【8】Global Attention Decoder for Chine Spelling Error Correction（ACL2021）

【9】Read, Listen, and See: Leveraging Multimodal Information Helps Chine Spell Checking（ACL2021）

【10】SpellBERT: A Lightweight Pretrained Model for Chine Spelling Check（EMNLP2021）

【11】A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check（EMNLP2018）

【12】Adversarial Semantic Decoupling for Recognizing Open-Vocabulary Slots（EMNLP2020）

捕的组词【13】Chunk-bad Chine Spelling Check with Global Optimization（EMNLP2020）

【14】Confusiont-guided Pointer Networks for Chine Spelling Check（ACL2019）

【15】Context-Sensitive Malicious Spelling Error Correction（WWW2019）

【16】FASPell： A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm (2019ACL)【17】SpellGCN：Incorporating Phonological and Visual Similarities into Language Models for Chine Spelling Check (2020ACL)

一天作文【18】Spelling Error Correction with Soft-Masked BERT（ACL2020）

在OpenReview上提交⾄ARR2022的相关稿件有：

【1】Exploring and Adapting Chine GPT to Pinyin Input Method 【】

【2】The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chine Spell Checking 【】【】【】

【3】Sparsity Regularization for Chine Spelling Check【】

【4】Pre-Training with Syntactic Structure Prediction for Chine Semantic Error Recognition【】

【5】ECSpellUD: Zero-shot Domain Adaptive Chine Spelling Check with Ur Dictionary【】

【6】SpelLM: Augmenting Chine Spell Check Using Input Salience【】【】【】

【7】Pinyin-bert: A new solution to Chine pinyin to character conversion task【】

简单总结⼀下⽬前CSC的⽅法：

多酶片的副作用基于BERT：以为CSC时是基于token（字符）级别的预测任务，输⼊输出序列长度⼀致，因此⾮常类似预训练语⾔模型的Masked Language Modeling（MLM），因此现如今绝⼤多数的⽅法是基于MLM实现的。⽽在BERT问世前，CSC则以RNN+Decoder、CRF为主；

多模态融合：上⽂提到CSC涉及到字⾳字形，因此有⼀些⽅法则是考虑如何将Word Embedding、Glyphic Embedding和Phonetic Embedding进⾏结合，因此涌现出⼀些多模态⽅法；

三、评测任务

CSC常⽤的三个评测数据分别如下：

SIGHAN Bake-off 2013:

SIGHAN Bake-off 2014:

SIGHAN Bake-off 2015:

现如今⼤多数的CSC⽅法均涉及到预训练环节，常⽤的预训练语料为

Wang271K:

评测数据和训练语料的数据分布情况如图所⽰：

具体的实验细节可以总结如下：

（1）Pre-training语料

语料：，随机挑选1M训练，max_len=128，batch_size=1024，lr=5e-5，step=10K。

获取《A Hybrid Approach to Automatic Corpus Generation for Chine Spelling Check》构建的271K语料

（2）fine-tuning语料

SIGHAN13、SIGHAN14、SIGHAN15 直接使⽤提供的数据，其中： ● merged：表⽰SIGHAN13、SIGHAN14和SIGHAN15混合训练集（10K）： ● SIGHAN13、SIGHAN14、SIGHAN15：分别表⽰

测试集

OCR 使⽤《FASPell: A Fast, Adaptable, Simple, Powerful Chine Spell Checker Bad On DAE-Decoder Paradigm》构建的数据集：，总共4575句⼦

（3）评测脚本：详见本⽂末

四、榜单

博主简单列出了常⽤的评测任务上的榜单（实时更新），如下表：

五、评测脚本

CSC常采⽤P、R、F1值进⾏评测，评测涉及到detection和correction两个层⾯，具体详见代码：import os import sys def convert_from_myformat_to_sighan (input_path , output_path , pred_path , orig_path =None , spellgcn =Fal ): with open (pred_path , "w") as labels_writer : with open (input_path , "r") as org_file , open (orig_path , "r") as id_f : with open (output_path , "r") as test_file : test_file = test_file .readlines () org_file = org_file .readlines () print (len (test_file ), len (org_file )) asrt len (test_file ) == len (org_file ) for k , (pred , inp , sid ) in enumerate (zip (test_file , org_file , id_f )): if spellgcn : _, atl = inp .strip ().split ("\t")

atl = atl .split (" ")[1:] pred = pred .split (" ")[1:len (atl )+1] el : atl , _, _= inp .strip ().split ("\t")[:3] atl = atl .split (" ") pred = pred .split (" ")[:len (atl )] output_list = [sid .split ()[0]] for i , (pt , at ) in enumerate (zip (pred [:], atl [:])): if at == "[SEP]" or at == '[PAD]': break # Post preprocess with unsupervid methods, #becau unsup BERT always predict punchuation at 1st pos if i == 0: if pt == "。" or pt == "，": continue if pt .startswith ("##"): pt = pt .lstrip ("##") if at .startswith ("##"): at = at .lstrip ("##") if pt != at : output_list .append (str (i +1)) output_list .append (pt ) if len (output_list ) == 1: output_list .append ("0") labels_writer .write (", ".join (output_list ) + "\n") def eval_spell (truth_path , pred_path , with_error =True ): #Compute F1-score detect_TP , detect_FP , detect_FN = 0, 0, 0 correct_TP , correct_FP , correct_FN = 0, 0, 0 detect_nt_TP , nt_P , nt_N , correct_nt_TP = 0, 0, 0, 0 dc_TP , dc_FP , dc_FN = 0, 0, 0 for idx , (pred , actual ) in enumerate (zip (open (pred_path , "r", encoding ='utf-8'), open (truth_path , "r", encoding ='utf-8') if with_error el open (truth_path , "r", encoding ='utf-8'))): pred_tokens = pred .strip ().split (" ") actual_tokens = actual .strip ().split (" ") #asrt pred_tokens[0] == actual_tokens[0] pred_tokens = pred_tokens [1:] actual_tokens = actual_tokens [1:] detect_actual_tokens = [int (actual_token .strip (",")) \ for i ,actual_token in enumerate (actual_tokens ) if i %2 ==0] correct_actual_tokens = [actual_token .strip (",") \ for i ,a

ctual_token in enumerate (actual_tokens ) if i %2 ==1] detect_pred_tokens = [int (pred_token .strip (",")) \ for i ,pred_token in enumerate (pred_tokens ) if i %2 ==0]

苏轼书法作品欣赏9

教师职称述职报告24

跳舞鞋舞蹈鞋

10平米

for i ,pred_token in enumerate (pred_tokens ) if i %2 ==0] _correct_pred_tokens = [pred_token .strip (",") \ for i ,pred_token in enumerate (pred_tokens ) if i %2 ==1] # Postpreprocess for ACL2019 csc paper which only deal with last detect positions in test data. # If we wanna follow the ACL2019 csc paper, we should take the detect_pred_tokens to: max_detect_pred_tokens = detect_pred_tokens correct_pred_zip = zip (detect_pred_tokens , _correct_pred_tokens ) correct_actual_zip = zip (detect_actual_tokens , correct_actual_tokens ) if detect_pred_tokens

[0] != 0: nt_P += 1 if sorted (correct_pred_zip ) == sorted (correct_actual_zip ): correct_nt_TP += 1 if detect_actual_tokens [0] != 0: if sorted (detect_actual_tokens ) == sorted (detect_pred_tokens ): detect_nt_TP += 1 nt_N += 1 if detect_actual_tokens [0]!=0: detect_TP += len (t (max_detect_pred_tokens ) & t (detect_actual_tokens )) detect_FN += len (t (detect_actual_tokens ) - t (max_detect_pred_tokens )) detect_FP += len (t (max_detect_pred_tokens ) - t (detect_actual_tokens )) correct_pred_tokens = [] #Only check the correct postion's tokens for dpt , cpt in zip (detect_pred_tokens , _correct_pred_tokens ): if dpt in detect_actual_tokens : correct_pred_tokens .append ((dpt ,cpt )) correct_TP += len (t (correct_pred_tokens ) & t (zip (detect_actual_tokens ,correct_actual_tokens ))) correct_FP += len (t (correct_pred_tokens ) - t (zip (detect_actual_tokens ,correct_actual_tokens ))) correct_FN += len (t (zip (detect_actual_tokens ,correct_actual_tokens )) - t (correct_pred_tokens )) # Caluate the correction level which depend on predictive detection of BERT dc_pred_tokens = zip (detect_pred_tokens , _correct_pred_tokens ) dc_actual_tokens = zip (detect_actual_tokens , correct_actual_tokens ) dc_TP += len (t (dc_pred_tokens ) & t (dc_actual_tokens )) dc_FP += len (t (dc_pred_tokens ) - t (dc_actual_tokens )) dc_FN += len (t (dc_actual_tokens ) - t (dc_pred_tokens )) detect_precision = detect_TP * 1.0 / (detect_TP + detect_FP ) detect_recall =

detect_TP * 1.0 / (detect_TP + detect_FN ) detect_F1 = 2. * detect_precision * detect_recall / ((detect_precision + detect_recall ) + 1e -8) correct_precision = correct_TP * 1.0 / (correct_TP + correct_FP ) correct_recall = correct_TP * 1.0 / (correct_TP + correct_FN ) correct_F1 = 2. * correct_precision * correct_recall / ((correct_precision + correct_recall ) + 1e -8) dc_precision = dc_TP * 1.0 / (dc_TP + dc_FP + 1e -8) dc_recall = dc_TP * 1.0 / (dc_TP + dc_FN + 1e -8) dc_F1 = 2. * dc_precision * dc_recall / (dc_precision + dc_recall + 1e -8) if with_error : #Token-level metrics print ("detect_precision=%f, detect_recall=%f, detect_Fscore=%f" %(detect_precision , detect_recall , detect_F1)) print ("correct_precision=%f, correct_recall=%f, correct_Fscore=%f" %(correct_precision , correct_recall , correct_F1)) print ("dc_joint_precision=%f, dc_joint_recall=%f, dc_joint_Fscore=%f" %(dc_precision , dc_recall , dc_F1)) detect_nt_precision = detect_nt_TP * 1.0 / (nt_P ) detect_nt_recall = detect_nt_TP * 1.0 / (nt_N ) detect_nt_F1 = 2. * detect_nt_precision * detect_nt_recall / ((detect_nt_precision + detect_nt_recall ) + 1e -8) correct_nt_precision = correct_nt_TP * 1.0 / (nt_P ) correct_nt_recall = correct_nt_TP * 1.0 / (nt_N ) correct_nt_F1 = 2. * correct_nt_precision * correct_nt_recall / ((correct_nt_precision + correct_nt_recall ) + 1e -8) if not with_error : #Sentence-level metrics print ("detect_nt_precision=%f, detect_nt_recall=%f, detect_Fscore=%f" %(detect_nt_precision , detect_nt_recall , detect_nt_F1)) print ("correct_nt_precision=%f, c

orrect_nt_recall=%f, correct_Fscore=%f" %(correct_nt_precision , correct_nt_recall , correct_nt_F1))if __name__ == '__main__': output_path = sys .argv [1] data_path = sys .argv [2] input_path = os .path .join (data_path , "") pred_path = os .path .join (os .path .dirname (output_path ), '') orig_input_path = os .path .join (data_path , "") orig_truth_path = os .path .join (data_path , "")616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126

本文发布于:2023-07-04 17:18:43，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/89/1067707.html

上一篇：火电厂专业英文词汇之二、锅炉专业(中英对照)

下一篇：纺织品专业词汇解析

标签：识别评测语料任务可能问题训练序列

留言与评论（共有 0 条评论）