基于T5模型的卡纳达语文本纠错

Sushmitha Ramaneedi, P. Pati
{"title":"基于T5模型的卡纳达语文本纠错","authors":"Sushmitha Ramaneedi, P. Pati","doi":"10.1109/I2CT57861.2023.10126228","DOIUrl":null,"url":null,"abstract":"Error creeps into text in various ways. Typing error may come due to either mis-typing or due to poor language expertise. Similarly, recognition technologies while converting textual images and speech into text may generate error due to their limitations. Irrespective of the channel of error induction, presence of error poses a huge challenge for downstream consumption of such textual content. Additionally, error present in Indian language textual documents come with their own set of issues. This necessitates focused study on the textual errors in Indian language documents and the various technologies which may be employed to eliminate them.This work proposes to employ mT5, a very popular deep learning based multi-lingual language model, to eliminate errors present in Kannada, an Indian Language, text. A pretrained model of mT5 is enhanced with transfer learning for a Kannada dataset. The ability of the enhanced mT5 model to reduce error is studied at various levels of noise. Character Error Rate (CER) is employed as the metric. It’s observed that the enhanced mT5 model is effectively able to reduce noise by 12% for input text with 25% CER.","PeriodicalId":150346,"journal":{"name":"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Kannada Textual Error Correction Using T5 Model\",\"authors\":\"Sushmitha Ramaneedi, P. Pati\",\"doi\":\"10.1109/I2CT57861.2023.10126228\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Error creeps into text in various ways. Typing error may come due to either mis-typing or due to poor language expertise. Similarly, recognition technologies while converting textual images and speech into text may generate error due to their limitations. Irrespective of the channel of error induction, presence of error poses a huge challenge for downstream consumption of such textual content. Additionally, error present in Indian language textual documents come with their own set of issues. This necessitates focused study on the textual errors in Indian language documents and the various technologies which may be employed to eliminate them.This work proposes to employ mT5, a very popular deep learning based multi-lingual language model, to eliminate errors present in Kannada, an Indian Language, text. A pretrained model of mT5 is enhanced with transfer learning for a Kannada dataset. The ability of the enhanced mT5 model to reduce error is studied at various levels of noise. Character Error Rate (CER) is employed as the metric. It’s observed that the enhanced mT5 model is effectively able to reduce noise by 12% for input text with 25% CER.\",\"PeriodicalId\":150346,\"journal\":{\"name\":\"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/I2CT57861.2023.10126228\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 8th International Conference for Convergence in Technology (I2CT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/I2CT57861.2023.10126228","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

错误以各种方式潜入文本。打字错误可能是由于打字错误或由于语言技能差。同样,识别技术在将文本图像和语音转换为文本时,也可能由于自身的局限性而产生错误。无论错误诱导的渠道如何,错误的存在对下游消费这些文本内容构成了巨大的挑战。此外,印度语文本文档中出现的错误也带来了自己的一系列问题。这就需要集中研究印度语言文件中的文本错误以及可以用来消除这些错误的各种技术。这项工作建议使用mT5,一个非常流行的基于深度学习的多语言语言模型,来消除卡纳达语(一种印度语言)文本中存在的错误。用迁移学习增强了一个预训练的mT5模型,用于卡纳达语数据集。研究了增强的mT5模型在不同噪声水平下减小误差的能力。字符错误率(CER)作为度量。观察到,增强的mT5模型能够有效地将噪声降低12%,输入文本的CER为25%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Kannada Textual Error Correction Using T5 Model
Error creeps into text in various ways. Typing error may come due to either mis-typing or due to poor language expertise. Similarly, recognition technologies while converting textual images and speech into text may generate error due to their limitations. Irrespective of the channel of error induction, presence of error poses a huge challenge for downstream consumption of such textual content. Additionally, error present in Indian language textual documents come with their own set of issues. This necessitates focused study on the textual errors in Indian language documents and the various technologies which may be employed to eliminate them.This work proposes to employ mT5, a very popular deep learning based multi-lingual language model, to eliminate errors present in Kannada, an Indian Language, text. A pretrained model of mT5 is enhanced with transfer learning for a Kannada dataset. The ability of the enhanced mT5 model to reduce error is studied at various levels of noise. Character Error Rate (CER) is employed as the metric. It’s observed that the enhanced mT5 model is effectively able to reduce noise by 12% for input text with 25% CER.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信