IME-Spell: Chinese Spelling Check based on Input Method

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval Pub Date : 2020-12-18 DOI:10.1145/3443279.3443297

Qingbiao Zhao, Xingfa Shen, Jian Yao

{"title":"IME-Spell: Chinese Spelling Check based on Input Method","authors":"Qingbiao Zhao, Xingfa Shen, Jian Yao","doi":"10.1145/3443279.3443297","DOIUrl":null,"url":null,"abstract":"Intended for reducing manual inspection costs and semantic misunderstandings, Chinese Spelling Check (CSC) has been investigated extensively in natural language processing. However, little work has yet been done on input-method-based CSC in which CSC can make use of Pinyin information to improve spelling correction efficiency. This paper proposes a novel CSC architecture, IME-Spell, based on pre-trained context vectors for input methods, which consists of two parts as follows. The Chinese spelling detection part of the architecture adopts the fusion vectors of character-based pre-trained context vectors and Pinyin vectors, and uses the method of sequence labeling to detect the error characters. The Chinese spelling correction part of the architecture adopts Masked Language Model (MLM) to generate a candidate set of erroneous characters, and uses XGBoost and Pinyin-to-Character conversion models to filter correct characters and correct the error characters for users. IME-Spell has a significant improvement over the benchmark models on the SIGHAN dataset, whose maximum difference of F1 in the spelling detection and correction subtasks reach 48.9% and 27.8%, respectively.","PeriodicalId":414366,"journal":{"name":"Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3443279.3443297","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Intended for reducing manual inspection costs and semantic misunderstandings, Chinese Spelling Check (CSC) has been investigated extensively in natural language processing. However, little work has yet been done on input-method-based CSC in which CSC can make use of Pinyin information to improve spelling correction efficiency. This paper proposes a novel CSC architecture, IME-Spell, based on pre-trained context vectors for input methods, which consists of two parts as follows. The Chinese spelling detection part of the architecture adopts the fusion vectors of character-based pre-trained context vectors and Pinyin vectors, and uses the method of sequence labeling to detect the error characters. The Chinese spelling correction part of the architecture adopts Masked Language Model (MLM) to generate a candidate set of erroneous characters, and uses XGBoost and Pinyin-to-Character conversion models to filter correct characters and correct the error characters for users. IME-Spell has a significant improvement over the benchmark models on the SIGHAN dataset, whose maximum difference of F1 in the spelling detection and correction subtasks reach 48.9% and 27.8%, respectively.

查看原文本刊更多论文

基于输入法的中文拼写检查

为了减少人工检查成本和语义误解，中文拼写检查在自然语言处理中得到了广泛的研究。然而，基于输入法的CSC利用拼音信息提高拼写纠错效率的研究还很少。本文提出了一种基于预训练的输入法上下文向量的新型CSC架构IME-Spell，该架构由以下两部分组成。该体系结构的中文拼写检测部分采用基于字符的预训练上下文向量与拼音向量的融合向量，并采用序列标注的方法检测错误字符。该体系结构的中文拼写纠错部分采用掩码语言模型(mask Language Model, MLM)生成错误字符候选集，并使用XGBoost和拼音字符转换模型过滤正确字符，为用户纠正错误字符。IME-Spell相较于SIGHAN数据集上的基准模型有了显著的改进，其在拼写检测和纠错子任务上的最大F1差值分别达到48.9%和27.8%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量