{"title":"Incorporating Confused Phraseological Knowledge Based on Pinyin Input Method for Chinese Spelling Correction","authors":"Weidong Zhao;Xiaoyu Wang;Liqing Qiu","doi":"10.1109/TBDATA.2025.3552344","DOIUrl":null,"url":null,"abstract":"Chinese Spelling Correction (CSC) is designed to detect and correct spelling errors that occur in Chinese text. In real life, most keyboard input scenarios use the pinyin input method. Researching spelling errors in this scenario is practical and valuable. However, there is currently no research that has truly proposed a model suitable for this scenario. Considering this concern, this paper proposes a model IPCK-IME, which incorporates confused phraseological knowledge based on the pinyin input method. The model integrates its own phonetic features with external similarity knowledge to guide the model to output more correct characters. Furthermore, to mitigate the influence of spelling errors on the semantics of sentences, a Gaussian bias is introduced into the self-attention network of the model. This approach aims to reduces the focus on typos and improve attention to local context. Empirical evidence indicates that our method surpasses existing models in correcting spelling errors generated by the pinyin input method. And, it is more appropriate for correcting Chinese spelling errors in real input scenarios.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 5","pages":"2724-2735"},"PeriodicalIF":5.7000,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10942550/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Chinese Spelling Correction (CSC) is designed to detect and correct spelling errors that occur in Chinese text. In real life, most keyboard input scenarios use the pinyin input method. Researching spelling errors in this scenario is practical and valuable. However, there is currently no research that has truly proposed a model suitable for this scenario. Considering this concern, this paper proposes a model IPCK-IME, which incorporates confused phraseological knowledge based on the pinyin input method. The model integrates its own phonetic features with external similarity knowledge to guide the model to output more correct characters. Furthermore, to mitigate the influence of spelling errors on the semantics of sentences, a Gaussian bias is introduced into the self-attention network of the model. This approach aims to reduces the focus on typos and improve attention to local context. Empirical evidence indicates that our method surpasses existing models in correcting spelling errors generated by the pinyin input method. And, it is more appropriate for correcting Chinese spelling errors in real input scenarios.
期刊介绍:
The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.