CMRUTU: Code Mixed Roman Urdu (Roman Urdu and English) to Urdu Translator

2022 24th International Multitopic Conference (INMIC) Pub Date : 2022-10-21 DOI:10.1109/INMIC56986.2022.9972972

Muhammad Wisal, A. Mustafa, Umair Arshad

{"title":"CMRUTU: Code Mixed Roman Urdu (Roman Urdu and English) to Urdu Translator","authors":"Muhammad Wisal, A. Mustafa, Umair Arshad","doi":"10.1109/INMIC56986.2022.9972972","DOIUrl":null,"url":null,"abstract":"Urdu is the official language of Pakistan and a familiar language in the South Asian countries. It is spoken as the first language by nearly 70 million people and as a second language by more than 100 million people, mainly in Pakistan and India. Most of the textual communication is not pure Roman Urdu. There are words of actual English in between those Roman Urdu sentences. It is necessary to have a translator that can translate these code-mixed sentences into Urdu because the purpose of any language is to communicate. It can be difficult for a machine to understand the shift of languages in between a sentence. In the past, researchers have worked on Urdu transliteration and rule-based translation. However, a pure translation of mixed Roman Urdu to Urdu with such accuracy is novel. In this research, we have introduced Mixed Language (Roman Urdu and English) to the Urdu translator. A deep learning pre-trained model “g2p_multilingual_byT5_small” is fine-tuned with a newly created corpus of Mixed Roman Urdu sentences and their translations in pure Urdu. With a BLEU score of 66.73, It can translate text messages, paragraphs, or any descriptions from Roman Urdu to Urdu. We have carried out this research using Python programming language and the model training on Google Colab.","PeriodicalId":404424,"journal":{"name":"2022 24th International Multitopic Conference (INMIC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 24th International Multitopic Conference (INMIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INMIC56986.2022.9972972","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Urdu is the official language of Pakistan and a familiar language in the South Asian countries. It is spoken as the first language by nearly 70 million people and as a second language by more than 100 million people, mainly in Pakistan and India. Most of the textual communication is not pure Roman Urdu. There are words of actual English in between those Roman Urdu sentences. It is necessary to have a translator that can translate these code-mixed sentences into Urdu because the purpose of any language is to communicate. It can be difficult for a machine to understand the shift of languages in between a sentence. In the past, researchers have worked on Urdu transliteration and rule-based translation. However, a pure translation of mixed Roman Urdu to Urdu with such accuracy is novel. In this research, we have introduced Mixed Language (Roman Urdu and English) to the Urdu translator. A deep learning pre-trained model “g2p_multilingual_byT5_small” is fine-tuned with a newly created corpus of Mixed Roman Urdu sentences and their translations in pure Urdu. With a BLEU score of 66.73, It can translate text messages, paragraphs, or any descriptions from Roman Urdu to Urdu. We have carried out this research using Python programming language and the model training on Google Colab.

查看原文本刊更多论文

代码混合罗马乌尔都语(罗马乌尔都语和英语)到乌尔都语翻译

乌尔都语是巴基斯坦的官方语言，也是南亚国家熟悉的语言。近7000万人将其作为第一语言，超过1亿人将其作为第二语言，主要是在巴基斯坦和印度。大多数文本交流不是纯粹的罗马乌尔都语。在那些罗马乌尔都语句子之间有一些真正的英语单词。有必要有一个译者，可以翻译这些代码混合的句子到乌尔都语，因为任何语言的目的是沟通。机器很难理解句子之间的语言转换。过去，研究人员对乌尔都语音译和基于规则的翻译进行了研究。然而，将混合罗马乌尔都语翻译成如此精确的乌尔都语是新颖的。在本研究中，我们将混合语言(罗马乌尔都语和英语)介绍给乌尔都语译者。深度学习预训练模型“g2p_multilingual_byT5_small”使用新创建的混合罗马乌尔都语句子语料库及其纯乌尔都语翻译进行微调。BLEU分数为66.73，它可以将文本信息，段落或任何描述从罗马乌尔都语翻译成乌尔都语。本研究采用Python编程语言，并在Google Colab上进行模型训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 24th International Multitopic Conference (INMIC)

自引率

0.00%

发文量