{"title":"Neural Machine Transliteration Of Indian Languages","authors":"Aryan Singh, Jhalak Bansal","doi":"10.1109/ICCCT53315.2021.9711806","DOIUrl":null,"url":null,"abstract":"Transliteration is a task of converting one language written in a foreign script to its written form in native script. It's not only important to understand the written form of language for transliteration but also the sound associated with the written words of the language. Hindi and Punjabi are two of the most widely spoken languages in the world with a combined base of around 500 million speakers. While English is widely understood now, regional languages remain the mainstay for spoken and written conversation. Most of the modern devices still come with English keyboards which makes it very difficult to express in regional languages. This research is aimed at developing a scalable and universal architecture that gives state of the art results for the transliteration of Hindi and Punjabi languages. It explores different heuristics in sequence to sequence modelling, attention and transformer networks to determine the best suited architecture for transliteration of Indian languages. Out of these variants, character/grapheme level bi-directional encoder and auto-regressive decoder model proved to be best-performing architecture and gave the state of the art results for both transliteration and back transliteration tasks with SOTA BLEU score of 0.88 on Punjabi and 0.97 on Hindi.","PeriodicalId":162171,"journal":{"name":"2021 4th International Conference on Computing and Communications Technologies (ICCCT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 4th International Conference on Computing and Communications Technologies (ICCCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCT53315.2021.9711806","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Transliteration is a task of converting one language written in a foreign script to its written form in native script. It's not only important to understand the written form of language for transliteration but also the sound associated with the written words of the language. Hindi and Punjabi are two of the most widely spoken languages in the world with a combined base of around 500 million speakers. While English is widely understood now, regional languages remain the mainstay for spoken and written conversation. Most of the modern devices still come with English keyboards which makes it very difficult to express in regional languages. This research is aimed at developing a scalable and universal architecture that gives state of the art results for the transliteration of Hindi and Punjabi languages. It explores different heuristics in sequence to sequence modelling, attention and transformer networks to determine the best suited architecture for transliteration of Indian languages. Out of these variants, character/grapheme level bi-directional encoder and auto-regressive decoder model proved to be best-performing architecture and gave the state of the art results for both transliteration and back transliteration tasks with SOTA BLEU score of 0.88 on Punjabi and 0.97 on Hindi.