{"title":"扩展-提取:一种基于可比语料库的英汉缅语机器翻译并行语料库挖掘框架","authors":"May Myo Zin, Teeradaj Racharak, Minh Le Nguyen","doi":"10.1109/ICTAI56018.2022.00045","DOIUrl":null,"url":null,"abstract":"High-quality neural machine translation (NMT) systems rely on the availability of large-scale and reliable parallel data. Since Myanmar language is a low-resource language, the parallel corpus of English-Myanmar language pair is sparse in volume. In this paper, we present a simple yet effective framework to create a parallel corpus from the available comparable corpora. Our proposed system first uses self-training and back-translation approaches together with the denoising-based automatic post-editing (DbAPE) system for augmenting synthetic datasets that are used to expand the size of existing comparable corpora. Then, LaBSE-based sentence embeddings and the proposed scoring function are applied to extract parallel sentences from the expanded comparable corpora. The extracted parallel sentences can be used to supplement parallel corpus when training the low-resource English-Myanmar NMT systems. We investigate the effectiveness of our methods by evaluating the NMT systems trained on the concatenation of parallel data created by our framework and an existing dataset. We show that the proposed framework is capable of creating a reliable parallel corpus, and that the created corpus substantially increases translation quality of MT systems trained on the existing parallel data, as measured by automatic evaluation metrics.","PeriodicalId":354314,"journal":{"name":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Expand-Extract: A Parallel Corpus Mining Framework from Comparable Corpora for English-Myanmar Machine Translation\",\"authors\":\"May Myo Zin, Teeradaj Racharak, Minh Le Nguyen\",\"doi\":\"10.1109/ICTAI56018.2022.00045\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High-quality neural machine translation (NMT) systems rely on the availability of large-scale and reliable parallel data. Since Myanmar language is a low-resource language, the parallel corpus of English-Myanmar language pair is sparse in volume. In this paper, we present a simple yet effective framework to create a parallel corpus from the available comparable corpora. Our proposed system first uses self-training and back-translation approaches together with the denoising-based automatic post-editing (DbAPE) system for augmenting synthetic datasets that are used to expand the size of existing comparable corpora. Then, LaBSE-based sentence embeddings and the proposed scoring function are applied to extract parallel sentences from the expanded comparable corpora. The extracted parallel sentences can be used to supplement parallel corpus when training the low-resource English-Myanmar NMT systems. We investigate the effectiveness of our methods by evaluating the NMT systems trained on the concatenation of parallel data created by our framework and an existing dataset. We show that the proposed framework is capable of creating a reliable parallel corpus, and that the created corpus substantially increases translation quality of MT systems trained on the existing parallel data, as measured by automatic evaluation metrics.\",\"PeriodicalId\":354314,\"journal\":{\"name\":\"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"105 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI56018.2022.00045\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI56018.2022.00045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Expand-Extract: A Parallel Corpus Mining Framework from Comparable Corpora for English-Myanmar Machine Translation
High-quality neural machine translation (NMT) systems rely on the availability of large-scale and reliable parallel data. Since Myanmar language is a low-resource language, the parallel corpus of English-Myanmar language pair is sparse in volume. In this paper, we present a simple yet effective framework to create a parallel corpus from the available comparable corpora. Our proposed system first uses self-training and back-translation approaches together with the denoising-based automatic post-editing (DbAPE) system for augmenting synthetic datasets that are used to expand the size of existing comparable corpora. Then, LaBSE-based sentence embeddings and the proposed scoring function are applied to extract parallel sentences from the expanded comparable corpora. The extracted parallel sentences can be used to supplement parallel corpus when training the low-resource English-Myanmar NMT systems. We investigate the effectiveness of our methods by evaluating the NMT systems trained on the concatenation of parallel data created by our framework and an existing dataset. We show that the proposed framework is capable of creating a reliable parallel corpus, and that the created corpus substantially increases translation quality of MT systems trained on the existing parallel data, as measured by automatic evaluation metrics.