{"title":"Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration","authors":"Hemanta Baruah, Sanasam Ranbir Singh, Priyankoo Sarmah","doi":"10.1145/3639565","DOIUrl":null,"url":null,"abstract":"<p>This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say <b>Pinyin</b> for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube and Twitter), we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1). PBSMT model and two separate neural transliteration models: (2). BiLSTM neural seq2seq model with attention, and (3). Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"54 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3639565","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the English Qwerty keyboard, Indian users on social media express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say Pinyin for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube and Twitter), we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1). PBSMT model and two separate neural transliteration models: (2). BiLSTM neural seq2seq model with attention, and (3). Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.
期刊介绍:
The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to:
-Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc.
-Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc.
-Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition.
-Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc.
-Machine Translation involving Asian or low-resource languages.
-Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc.
-Information Extraction and Filtering: including automatic abstraction, user profiling, etc.
-Speech processing: including text-to-speech synthesis and automatic speech recognition.
-Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc.
-Cross-lingual information processing involving Asian or low-resource languages.
-Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.