Expand-Extract: A Parallel Corpus Mining Framework from Comparable Corpora for English-Myanmar Machine Translation

2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI) Pub Date : 2022-10-01 DOI:10.1109/ICTAI56018.2022.00045

May Myo Zin, Teeradaj Racharak, Minh Le Nguyen

{"title":"Expand-Extract: A Parallel Corpus Mining Framework from Comparable Corpora for English-Myanmar Machine Translation","authors":"May Myo Zin, Teeradaj Racharak, Minh Le Nguyen","doi":"10.1109/ICTAI56018.2022.00045","DOIUrl":null,"url":null,"abstract":"High-quality neural machine translation (NMT) systems rely on the availability of large-scale and reliable parallel data. Since Myanmar language is a low-resource language, the parallel corpus of English-Myanmar language pair is sparse in volume. In this paper, we present a simple yet effective framework to create a parallel corpus from the available comparable corpora. Our proposed system first uses self-training and back-translation approaches together with the denoising-based automatic post-editing (DbAPE) system for augmenting synthetic datasets that are used to expand the size of existing comparable corpora. Then, LaBSE-based sentence embeddings and the proposed scoring function are applied to extract parallel sentences from the expanded comparable corpora. The extracted parallel sentences can be used to supplement parallel corpus when training the low-resource English-Myanmar NMT systems. We investigate the effectiveness of our methods by evaluating the NMT systems trained on the concatenation of parallel data created by our framework and an existing dataset. We show that the proposed framework is capable of creating a reliable parallel corpus, and that the created corpus substantially increases translation quality of MT systems trained on the existing parallel data, as measured by automatic evaluation metrics.","PeriodicalId":354314,"journal":{"name":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI56018.2022.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

High-quality neural machine translation (NMT) systems rely on the availability of large-scale and reliable parallel data. Since Myanmar language is a low-resource language, the parallel corpus of English-Myanmar language pair is sparse in volume. In this paper, we present a simple yet effective framework to create a parallel corpus from the available comparable corpora. Our proposed system first uses self-training and back-translation approaches together with the denoising-based automatic post-editing (DbAPE) system for augmenting synthetic datasets that are used to expand the size of existing comparable corpora. Then, LaBSE-based sentence embeddings and the proposed scoring function are applied to extract parallel sentences from the expanded comparable corpora. The extracted parallel sentences can be used to supplement parallel corpus when training the low-resource English-Myanmar NMT systems. We investigate the effectiveness of our methods by evaluating the NMT systems trained on the concatenation of parallel data created by our framework and an existing dataset. We show that the proposed framework is capable of creating a reliable parallel corpus, and that the created corpus substantially increases translation quality of MT systems trained on the existing parallel data, as measured by automatic evaluation metrics.

查看原文本刊更多论文

扩展-提取:一种基于可比语料库的英汉缅语机器翻译并行语料库挖掘框架

高质量的神经机器翻译(NMT)系统依赖于大量可靠的并行数据的可用性。由于缅甸语是一种低资源语言，英汉缅语对的平行语料库在数量上是稀疏的。在本文中，我们提出了一个简单而有效的框架，从可用的可比语料库中创建并行语料库。我们提出的系统首先使用自我训练和反向翻译方法以及基于去噪的自动后期编辑(DbAPE)系统来增加合成数据集，用于扩展现有可比语料库的大小。然后，应用基于labse的句子嵌入和提出的评分函数从扩展的可比语料库中提取平行句子。所提取的并列句可用于训练低资源英语-缅甸语NMT系统时补充并列语料库。我们通过评估由我们的框架和现有数据集创建的并行数据串联训练的NMT系统来研究我们方法的有效性。我们证明了所提出的框架能够创建一个可靠的并行语料库，并且创建的语料库大大提高了在现有并行数据上训练的机器翻译系统的翻译质量，如自动评估指标所衡量的那样。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)

自引率

0.00%

发文量