Joint Approach to Deromanization of Code-mixed Texts

Proceedings of the Sixth Workshop on Pub Date : 2019-06-01 DOI:10.18653/v1/W19-1403

Rashed Rubby Riyadh, Grzegorz Kondrak

引用次数: 6

Abstract

The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.

查看原文本刊更多论文

语码混合语篇非罗曼化的联合研究

由于不一致的罗马化约定和非标准的语言使用，将罗马化文本转换回本地脚本是一项具有挑战性的任务。代码混合(即在同一话语中使用一种以上语言的单词)使这个问题更加复杂。在本文中，我们提出了一种在单一系统中同时处理这两个问题的新方法。我们的方法结合了三个组成部分:语言识别、反音译和序列预测。我们在孟加拉语和印地语数据集上的实验结果为代码混合文本的非罗马化任务建立了最先进的技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Sixth Workshop on

自引率

0.00%

发文量