计数、分解和纠正：手写汉字纠错的新方法

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2024-11-19 DOI:10.1016/j.patcog.2024.111110

Pengfei Hu , Jiefeng Ma , Zhenrong Zhang , Jun Du , Jianshu Zhang

{"title":"计数、分解和纠正：手写汉字纠错的新方法","authors":"Pengfei Hu , Jiefeng Ma , Zhenrong Zhang , Jun Du , Jianshu Zhang","doi":"10.1016/j.patcog.2024.111110","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, handwritten Chinese character error correction has been greatly improved by employing encoder–decoder methods to decompose a Chinese character into an ideographic description sequence (IDS). However, existing methods implicitly capture and encode linguistic information inherent in IDS sequences, leading to a tendency to generate IDS sequences that match seen characters. This poses a challenge when dealing with an unseen misspelled character, as the decoder may generate an IDS sequence that matches a seen character instead. Therefore, we introduce Count, Decompose and Correct (CDC), a novel approach that exhibits better generalization towards unseen misspelled characters. CDC is mainly composed of three parts: the Counter, the Decomposer, and the Corrector. In the first stage, the Counter predicts the number of each radical class without the symbol-level position annotations. In the second stage, the Decomposer employs the counting information and generates the IDS sequence step by step. Moreover, by updating the counting information at each time step, the Decomposer becomes aware of the existence of each radical. With the decomposed IDS sequence, we can determine whether the given character is misspelled. If it is misspelled, the Corrector under the transductive transfer learning strategy predicts the ideal character that the user originally intended to write. We integrate our method into existing encoder–decoder models and significantly enhance their performance.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111110"},"PeriodicalIF":7.5000,"publicationDate":"2024-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Count, decompose and correct: A new approach to handwritten Chinese character error correction\",\"authors\":\"Pengfei Hu , Jiefeng Ma , Zhenrong Zhang , Jun Du , Jianshu Zhang\",\"doi\":\"10.1016/j.patcog.2024.111110\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Recently, handwritten Chinese character error correction has been greatly improved by employing encoder–decoder methods to decompose a Chinese character into an ideographic description sequence (IDS). However, existing methods implicitly capture and encode linguistic information inherent in IDS sequences, leading to a tendency to generate IDS sequences that match seen characters. This poses a challenge when dealing with an unseen misspelled character, as the decoder may generate an IDS sequence that matches a seen character instead. Therefore, we introduce Count, Decompose and Correct (CDC), a novel approach that exhibits better generalization towards unseen misspelled characters. CDC is mainly composed of three parts: the Counter, the Decomposer, and the Corrector. In the first stage, the Counter predicts the number of each radical class without the symbol-level position annotations. In the second stage, the Decomposer employs the counting information and generates the IDS sequence step by step. Moreover, by updating the counting information at each time step, the Decomposer becomes aware of the existence of each radical. With the decomposed IDS sequence, we can determine whether the given character is misspelled. If it is misspelled, the Corrector under the transductive transfer learning strategy predicts the ideal character that the user originally intended to write. We integrate our method into existing encoder–decoder models and significantly enhance their performance.</div></div>\",\"PeriodicalId\":49713,\"journal\":{\"name\":\"Pattern Recognition\",\"volume\":\"160 \",\"pages\":\"Article 111110\"},\"PeriodicalIF\":7.5000,\"publicationDate\":\"2024-11-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0031320324008616\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320324008616","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

最近，通过使用编码器-解码器方法将汉字分解为表意描述序列（IDS），手写汉字纠错得到了极大的改进。然而，现有方法会隐含地捕捉和编码 IDS 序列中固有的语言信息，导致生成的 IDS 序列往往与所见字符相匹配。这给处理未见的拼写错误字符带来了挑战，因为解码器可能会生成与已见字符相匹配的 IDS 序列。因此，我们引入了计数、分解和校正（CDC）方法，这是一种新颖的方法，对未见过的拼写错误字符具有更好的通用性。CDC 主要由三部分组成：计数器、分解器和校正器。在第一阶段，计数器在没有符号级位置注释的情况下预测每个部首类别的数量。在第二阶段，分解器利用计数信息逐步生成 IDS 序列。此外，通过在每个时间步更新计数信息，分解器会意识到每个部首的存在。有了分解后的 IDS 序列，我们就能确定给定字符是否拼写错误。如果是拼写错误，则根据转导式迁移学习策略，校正器会预测出用户最初想要写的理想字符。我们将我们的方法集成到现有的编码器-解码器模型中，大大提高了它们的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Count, decompose and correct: A new approach to handwritten Chinese character error correction

Recently, handwritten Chinese character error correction has been greatly improved by employing encoder–decoder methods to decompose a Chinese character into an ideographic description sequence (IDS). However, existing methods implicitly capture and encode linguistic information inherent in IDS sequences, leading to a tendency to generate IDS sequences that match seen characters. This poses a challenge when dealing with an unseen misspelled character, as the decoder may generate an IDS sequence that matches a seen character instead. Therefore, we introduce Count, Decompose and Correct (CDC), a novel approach that exhibits better generalization towards unseen misspelled characters. CDC is mainly composed of three parts: the Counter, the Decomposer, and the Corrector. In the first stage, the Counter predicts the number of each radical class without the symbol-level position annotations. In the second stage, the Decomposer employs the counting information and generates the IDS sequence step by step. Moreover, by updating the counting information at each time step, the Decomposer becomes aware of the existence of each radical. With the decomposed IDS sequence, we can determine whether the given character is misspelled. If it is misspelled, the Corrector under the transductive transfer learning strategy predicts the ideal character that the user originally intended to write. We integrate our method into existing encoder–decoder models and significantly enhance their performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.