{"title":"CMRUTU: Code Mixed Roman Urdu (Roman Urdu and English) to Urdu Translator","authors":"Muhammad Wisal, A. Mustafa, Umair Arshad","doi":"10.1109/INMIC56986.2022.9972972","DOIUrl":null,"url":null,"abstract":"Urdu is the official language of Pakistan and a familiar language in the South Asian countries. It is spoken as the first language by nearly 70 million people and as a second language by more than 100 million people, mainly in Pakistan and India. Most of the textual communication is not pure Roman Urdu. There are words of actual English in between those Roman Urdu sentences. It is necessary to have a translator that can translate these code-mixed sentences into Urdu because the purpose of any language is to communicate. It can be difficult for a machine to understand the shift of languages in between a sentence. In the past, researchers have worked on Urdu transliteration and rule-based translation. However, a pure translation of mixed Roman Urdu to Urdu with such accuracy is novel. In this research, we have introduced Mixed Language (Roman Urdu and English) to the Urdu translator. A deep learning pre-trained model “g2p_multilingual_byT5_small” is fine-tuned with a newly created corpus of Mixed Roman Urdu sentences and their translations in pure Urdu. With a BLEU score of 66.73, It can translate text messages, paragraphs, or any descriptions from Roman Urdu to Urdu. We have carried out this research using Python programming language and the model training on Google Colab.","PeriodicalId":404424,"journal":{"name":"2022 24th International Multitopic Conference (INMIC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 24th International Multitopic Conference (INMIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INMIC56986.2022.9972972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Urdu is the official language of Pakistan and a familiar language in the South Asian countries. It is spoken as the first language by nearly 70 million people and as a second language by more than 100 million people, mainly in Pakistan and India. Most of the textual communication is not pure Roman Urdu. There are words of actual English in between those Roman Urdu sentences. It is necessary to have a translator that can translate these code-mixed sentences into Urdu because the purpose of any language is to communicate. It can be difficult for a machine to understand the shift of languages in between a sentence. In the past, researchers have worked on Urdu transliteration and rule-based translation. However, a pure translation of mixed Roman Urdu to Urdu with such accuracy is novel. In this research, we have introduced Mixed Language (Roman Urdu and English) to the Urdu translator. A deep learning pre-trained model “g2p_multilingual_byT5_small” is fine-tuned with a newly created corpus of Mixed Roman Urdu sentences and their translations in pure Urdu. With a BLEU score of 66.73, It can translate text messages, paragraphs, or any descriptions from Roman Urdu to Urdu. We have carried out this research using Python programming language and the model training on Google Colab.