Shailashree K. Sheshadri, Deepa Gupta, M. Costa-jussà
{"title":"使用预训练嵌入的克什米尔语到英语和印地语的神经机器翻译","authors":"Shailashree K. Sheshadri, Deepa Gupta, M. Costa-jussà","doi":"10.1109/OCIT56763.2022.00053","DOIUrl":null,"url":null,"abstract":"Neural Machine Translation (NMT) is one of the advanced approaches of Machine Translation (MT) that has recently gained popularity. A significant amount of parallel corpus is required to achieve a sound translation system, but most languages have a deficit worldwide. Many SoTA NMT systems are available for low-resource langauges that are developed using transfer learning, knowledge transfer, and zero-shot learning mechanisms. Most Indic languages fall into low-resource and zero-resource due to the non-availability of rich parallel and monolingual corpora. Though many Indian border languages have social and economic significance, they lack resources and automated machine translation systems. Kashmiri, one such Indian border language, belongs to the zero-resource category with limited corpora and no significant translation system. This paper uses pre-trained word embeddings to create the first NMT system specifically for Kashmiri-English and Kashmiri-Hindi translation. mBPE pre-trained word embeddings for Kashmiri language are used to develop the NMT system. A pre-trained word embedding model shows +2.58 BLEU improvisation in comparison to Vanilla NMT.","PeriodicalId":425541,"journal":{"name":"2022 OITS International Conference on Information Technology (OCIT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Neural Machine Translation for Kashmiri to English and Hindi using Pre-trained Embeddings\",\"authors\":\"Shailashree K. Sheshadri, Deepa Gupta, M. Costa-jussà\",\"doi\":\"10.1109/OCIT56763.2022.00053\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural Machine Translation (NMT) is one of the advanced approaches of Machine Translation (MT) that has recently gained popularity. A significant amount of parallel corpus is required to achieve a sound translation system, but most languages have a deficit worldwide. Many SoTA NMT systems are available for low-resource langauges that are developed using transfer learning, knowledge transfer, and zero-shot learning mechanisms. Most Indic languages fall into low-resource and zero-resource due to the non-availability of rich parallel and monolingual corpora. Though many Indian border languages have social and economic significance, they lack resources and automated machine translation systems. Kashmiri, one such Indian border language, belongs to the zero-resource category with limited corpora and no significant translation system. This paper uses pre-trained word embeddings to create the first NMT system specifically for Kashmiri-English and Kashmiri-Hindi translation. mBPE pre-trained word embeddings for Kashmiri language are used to develop the NMT system. A pre-trained word embedding model shows +2.58 BLEU improvisation in comparison to Vanilla NMT.\",\"PeriodicalId\":425541,\"journal\":{\"name\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"volume\":\"14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 OITS International Conference on Information Technology (OCIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/OCIT56763.2022.00053\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 OITS International Conference on Information Technology (OCIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/OCIT56763.2022.00053","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Neural Machine Translation for Kashmiri to English and Hindi using Pre-trained Embeddings
Neural Machine Translation (NMT) is one of the advanced approaches of Machine Translation (MT) that has recently gained popularity. A significant amount of parallel corpus is required to achieve a sound translation system, but most languages have a deficit worldwide. Many SoTA NMT systems are available for low-resource langauges that are developed using transfer learning, knowledge transfer, and zero-shot learning mechanisms. Most Indic languages fall into low-resource and zero-resource due to the non-availability of rich parallel and monolingual corpora. Though many Indian border languages have social and economic significance, they lack resources and automated machine translation systems. Kashmiri, one such Indian border language, belongs to the zero-resource category with limited corpora and no significant translation system. This paper uses pre-trained word embeddings to create the first NMT system specifically for Kashmiri-English and Kashmiri-Hindi translation. mBPE pre-trained word embeddings for Kashmiri language are used to develop the NMT system. A pre-trained word embedding model shows +2.58 BLEU improvisation in comparison to Vanilla NMT.