A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-08-07 DOI:10.1016/j.csl.2024.101703

Mehedi Hasan Bijoy , Nahid Hossain , Salekul Islam , Swakkhar Shatabda

{"title":"A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages","authors":"Mehedi Hasan Bijoy , Nahid Hossain , Salekul Islam , Swakkhar Shatabda","doi":"10.1016/j.csl.2024.101703","DOIUrl":null,"url":null,"abstract":"<div><p>Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this paper, we propose a novel detector-purificator-corrector framework, DPCSpell based on denoising transformers by addressing previous issues. In addition to that, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach, which outperforms previous state-of-the-art methods by attaining an exact match (EM) score of 94.78%, a precision score of 0.9487, a recall score of 0.9478, an f1 score of 0.948, an f0.5 score of 0.9483, and a modified accuracy (MA) score of 95.16% for Bangla spelling error correction. The models and corpus are publicly available at <span><span>https://tinyurl.com/DPCSpell</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101703"},"PeriodicalIF":3.1000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S088523082400086X/pdfft?md5=42e971181da3ed460a728ce6126888c9&pid=1-s2.0-S088523082400086X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088523082400086X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this paper, we propose a novel detector-purificator-corrector framework, DPCSpell based on denoising transformers by addressing previous issues. In addition to that, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach, which outperforms previous state-of-the-art methods by attaining an exact match (EM) score of 94.78%, a precision score of 0.9487, a recall score of 0.9478, an f1 score of 0.948, an f0.5 score of 0.9483, and a modified accuracy (MA) score of 95.16% for Bangla spelling error correction. The models and corpus are publicly available at https://tinyurl.com/DPCSpell.

查看原文本刊更多论文

基于转换器的孟加拉语和资源匮乏的印度语言拼写错误纠正框架

拼写错误纠正是一项识别和纠正文本中拼写错误单词的任务。由于在人类语言理解中的大量应用，它是自然语言处理中一个潜在而活跃的研究课题。在任何语言中，语音或视觉相似但语义不同的字符都是一项艰巨的任务。早期在孟加拉语和资源稀缺的印度语言中进行拼写错误纠正的工作主要集中在基于规则、统计和机器学习的方法上，但我们发现这些方法效率很低。尤其是基于机器学习的方法，虽然比基于规则和统计的方法表现出更优越的性能，但其效果并不好，因为它们会不顾每个字符是否合适而对其进行纠正。在本文中，我们针对之前存在的问题，提出了一种基于去噪变换器的新型检测器-净化器-校正器框架 DPCSpell。此外，我们还提出了一种从零开始创建大规模语料库的方法，从而解决了任何从左到右脚本语言的资源限制问题。实证结果证明了我们的方法的有效性，在孟加拉语拼写错误纠正方面，我们的方法优于之前的先进方法，精确匹配（EM）得分 94.78%，精确度得分 0.9487，召回得分 0.9478，f1 得分 0.948，f0.5 得分 0.9483，修正准确度（MA）得分 95.16%。有关模型和语料库可在以下网址公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.