{"title":"Expand-Extract: A Parallel Corpus Mining Framework from Comparable Corpora for English-Myanmar Machine Translation","authors":"May Myo Zin, Teeradaj Racharak, Minh Le Nguyen","doi":"10.1109/ICTAI56018.2022.00045","DOIUrl":null,"url":null,"abstract":"High-quality neural machine translation (NMT) systems rely on the availability of large-scale and reliable parallel data. Since Myanmar language is a low-resource language, the parallel corpus of English-Myanmar language pair is sparse in volume. In this paper, we present a simple yet effective framework to create a parallel corpus from the available comparable corpora. Our proposed system first uses self-training and back-translation approaches together with the denoising-based automatic post-editing (DbAPE) system for augmenting synthetic datasets that are used to expand the size of existing comparable corpora. Then, LaBSE-based sentence embeddings and the proposed scoring function are applied to extract parallel sentences from the expanded comparable corpora. The extracted parallel sentences can be used to supplement parallel corpus when training the low-resource English-Myanmar NMT systems. We investigate the effectiveness of our methods by evaluating the NMT systems trained on the concatenation of parallel data created by our framework and an existing dataset. We show that the proposed framework is capable of creating a reliable parallel corpus, and that the created corpus substantially increases translation quality of MT systems trained on the existing parallel data, as measured by automatic evaluation metrics.","PeriodicalId":354314,"journal":{"name":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI56018.2022.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
High-quality neural machine translation (NMT) systems rely on the availability of large-scale and reliable parallel data. Since Myanmar language is a low-resource language, the parallel corpus of English-Myanmar language pair is sparse in volume. In this paper, we present a simple yet effective framework to create a parallel corpus from the available comparable corpora. Our proposed system first uses self-training and back-translation approaches together with the denoising-based automatic post-editing (DbAPE) system for augmenting synthetic datasets that are used to expand the size of existing comparable corpora. Then, LaBSE-based sentence embeddings and the proposed scoring function are applied to extract parallel sentences from the expanded comparable corpora. The extracted parallel sentences can be used to supplement parallel corpus when training the low-resource English-Myanmar NMT systems. We investigate the effectiveness of our methods by evaluating the NMT systems trained on the concatenation of parallel data created by our framework and an existing dataset. We show that the proposed framework is capable of creating a reliable parallel corpus, and that the created corpus substantially increases translation quality of MT systems trained on the existing parallel data, as measured by automatic evaluation metrics.