使用预训练嵌入的克什米尔语到英语和印地语的神经机器翻译

2022 OITS International Conference on Information Technology (OCIT) Pub Date : 2022-12-01 DOI:10.1109/OCIT56763.2022.00053

Shailashree K. Sheshadri, Deepa Gupta, M. Costa-jussà

{"title":"使用预训练嵌入的克什米尔语到英语和印地语的神经机器翻译","authors":"Shailashree K. Sheshadri, Deepa Gupta, M. Costa-jussà","doi":"10.1109/OCIT56763.2022.00053","DOIUrl":null,"url":null,"abstract":"Neural Machine Translation (NMT) is one of the advanced approaches of Machine Translation (MT) that has recently gained popularity. A significant amount of parallel corpus is required to achieve a sound translation system, but most languages have a deficit worldwide. Many SoTA NMT systems are available for low-resource langauges that are developed using transfer learning, knowledge transfer, and zero-shot learning mechanisms. Most Indic languages fall into low-resource and zero-resource due to the non-availability of rich parallel and monolingual corpora. Though many Indian border languages have social and economic significance, they lack resources and automated machine translation systems. Kashmiri, one such Indian border language, belongs to the zero-resource category with limited corpora and no significant translation system. This paper uses pre-trained word embeddings to create the first NMT system specifically for Kashmiri-English and Kashmiri-Hindi translation. mBPE pre-trained word embeddings for Kashmiri language are used to develop the NMT system. A pre-trained word embedding model shows +2.58 BLEU improvisation in comparison to Vanilla NMT.","PeriodicalId":425541,"journal":{"name":"2022 OITS International Conference on Information Technology (OCIT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Neural Machine Translation for Kashmiri to English and Hindi using Pre-trained Embeddings\",\"authors\":\"Shailashree K. Sheshadri, Deepa Gupta, M. Costa-jussà\",\"doi\":\"10.1109/OCIT56763.2022.00053\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural Machine Translation (NMT) is one of the advanced approaches of Machine Translation (MT) that has recently gained popularity. A significant amount of parallel corpus is required to achieve a sound translation system, but most languages have a deficit worldwide. Many SoTA NMT systems are available for low-resource langauges that are developed using transfer learning, knowledge transfer, and zero-shot learning mechanisms. Most Indic languages fall into low-resource and zero-resource due to the non-availability of rich parallel and monolingual corpora. Though many Indian border languages have social and economic significance, they lack resources and automated machine translation systems. Kashmiri, one such Indian border language, belongs to the zero-resource category with limited corpora and no significant translation system. This paper uses pre-trained word embeddings to create the first NMT system specifically for Kashmiri-English and Kashmiri-Hindi translation. mBPE pre-trained word embeddings for Kashmiri language are used to develop the NMT system. A pre-trained word embedding model shows +2.58 BLEU improvisation in comparison to Vanilla NMT.\",\"PeriodicalId\":425541,\"journal\":{\"name\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/OCIT56763.2022.00053\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 OITS International Conference on Information Technology (OCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/OCIT56763.2022.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

神经机器翻译(Neural Machine Translation, NMT)是近年来兴起的一种先进的机器翻译方法。要实现一个完善的翻译系统，需要大量的平行语料库，但在世界范围内，大多数语言都存在缺陷。许多SoTA NMT系统可用于使用迁移学习、知识迁移和零学习机制开发的低资源语言。由于没有丰富的并行语料库和单语语料库，大多数印度语陷入低资源和零资源的境地。尽管许多印度边境语言具有社会和经济意义，但它们缺乏资源和自动机器翻译系统。克什米尔语属于零资源范畴，语料库有限，没有重要的翻译系统。本文使用预训练词嵌入来创建第一个专门用于克什米尔-英语和克什米尔-印地语翻译的NMT系统。使用mBPE预训练的克什米尔语词嵌入来开发NMT系统。与Vanilla NMT相比，预训练的词嵌入模型显示了+2.58的BLEU即兴性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Neural Machine Translation for Kashmiri to English and Hindi using Pre-trained Embeddings

Neural Machine Translation (NMT) is one of the advanced approaches of Machine Translation (MT) that has recently gained popularity. A significant amount of parallel corpus is required to achieve a sound translation system, but most languages have a deficit worldwide. Many SoTA NMT systems are available for low-resource langauges that are developed using transfer learning, knowledge transfer, and zero-shot learning mechanisms. Most Indic languages fall into low-resource and zero-resource due to the non-availability of rich parallel and monolingual corpora. Though many Indian border languages have social and economic significance, they lack resources and automated machine translation systems. Kashmiri, one such Indian border language, belongs to the zero-resource category with limited corpora and no significant translation system. This paper uses pre-trained word embeddings to create the first NMT system specifically for Kashmiri-English and Kashmiri-Hindi translation. mBPE pre-trained word embeddings for Kashmiri language are used to develop the NMT system. A pre-trained word embedding model shows +2.58 BLEU improvisation in comparison to Vanilla NMT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 OITS International Conference on Information Technology (OCIT)

自引率

0.00%

发文量