基于T5模型的卡纳达语文本纠错

2023 IEEE 8th International Conference for Convergence in Technology (I2CT) Pub Date : 2023-04-07 DOI:10.1109/I2CT57861.2023.10126228

Sushmitha Ramaneedi, P. Pati

{"title":"基于T5模型的卡纳达语文本纠错","authors":"Sushmitha Ramaneedi, P. Pati","doi":"10.1109/I2CT57861.2023.10126228","DOIUrl":null,"url":null,"abstract":"Error creeps into text in various ways. Typing error may come due to either mis-typing or due to poor language expertise. Similarly, recognition technologies while converting textual images and speech into text may generate error due to their limitations. Irrespective of the channel of error induction, presence of error poses a huge challenge for downstream consumption of such textual content. Additionally, error present in Indian language textual documents come with their own set of issues. This necessitates focused study on the textual errors in Indian language documents and the various technologies which may be employed to eliminate them.This work proposes to employ mT5, a very popular deep learning based multi-lingual language model, to eliminate errors present in Kannada, an Indian Language, text. A pretrained model of mT5 is enhanced with transfer learning for a Kannada dataset. The ability of the enhanced mT5 model to reduce error is studied at various levels of noise. Character Error Rate (CER) is employed as the metric. It’s observed that the enhanced mT5 model is effectively able to reduce noise by 12% for input text with 25% CER.","PeriodicalId":150346,"journal":{"name":"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Kannada Textual Error Correction Using T5 Model\",\"authors\":\"Sushmitha Ramaneedi, P. Pati\",\"doi\":\"10.1109/I2CT57861.2023.10126228\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Error creeps into text in various ways. Typing error may come due to either mis-typing or due to poor language expertise. Similarly, recognition technologies while converting textual images and speech into text may generate error due to their limitations. Irrespective of the channel of error induction, presence of error poses a huge challenge for downstream consumption of such textual content. Additionally, error present in Indian language textual documents come with their own set of issues. This necessitates focused study on the textual errors in Indian language documents and the various technologies which may be employed to eliminate them.This work proposes to employ mT5, a very popular deep learning based multi-lingual language model, to eliminate errors present in Kannada, an Indian Language, text. A pretrained model of mT5 is enhanced with transfer learning for a Kannada dataset. The ability of the enhanced mT5 model to reduce error is studied at various levels of noise. Character Error Rate (CER) is employed as the metric. It’s observed that the enhanced mT5 model is effectively able to reduce noise by 12% for input text with 25% CER.\",\"PeriodicalId\":150346,\"journal\":{\"name\":\"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)\",\"volume\":\"51 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/I2CT57861.2023.10126228\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/I2CT57861.2023.10126228","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

错误以各种方式潜入文本。打字错误可能是由于打字错误或由于语言技能差。同样，识别技术在将文本图像和语音转换为文本时，也可能由于自身的局限性而产生错误。无论错误诱导的渠道如何，错误的存在对下游消费这些文本内容构成了巨大的挑战。此外，印度语文本文档中出现的错误也带来了自己的一系列问题。这就需要集中研究印度语言文件中的文本错误以及可以用来消除这些错误的各种技术。这项工作建议使用mT5，一个非常流行的基于深度学习的多语言语言模型，来消除卡纳达语(一种印度语言)文本中存在的错误。用迁移学习增强了一个预训练的mT5模型，用于卡纳达语数据集。研究了增强的mT5模型在不同噪声水平下减小误差的能力。字符错误率(CER)作为度量。观察到，增强的mT5模型能够有效地将噪声降低12%，输入文本的CER为25%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Kannada Textual Error Correction Using T5 Model

Error creeps into text in various ways. Typing error may come due to either mis-typing or due to poor language expertise. Similarly, recognition technologies while converting textual images and speech into text may generate error due to their limitations. Irrespective of the channel of error induction, presence of error poses a huge challenge for downstream consumption of such textual content. Additionally, error present in Indian language textual documents come with their own set of issues. This necessitates focused study on the textual errors in Indian language documents and the various technologies which may be employed to eliminate them.This work proposes to employ mT5, a very popular deep learning based multi-lingual language model, to eliminate errors present in Kannada, an Indian Language, text. A pretrained model of mT5 is enhanced with transfer learning for a Kannada dataset. The ability of the enhanced mT5 model to reduce error is studied at various levels of noise. Character Error Rate (CER) is employed as the metric. It’s observed that the enhanced mT5 model is effectively able to reduce noise by 12% for input text with 25% CER.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 IEEE 8th International Conference for Convergence in Technology (I2CT)

自引率

0.00%

发文量