{"title":"中越语低资源机器翻译数据增强技术的可扩展性研究","authors":"Huan Vu, Ngoc-Dung Bui","doi":"10.1080/24751839.2023.2186625","DOIUrl":null,"url":null,"abstract":"ABSTRACT Neural Machine Translation (NMT) has constantly been shown to be a standard choice to build a translation system, in both academia and industry. For low-resource language pairs, data augmentation techniques have been widely used to tackle the data shortage problem in NMT. In this paper, we investigate the scaling behaviour of transformer-based NMT model to the increasing amount of synthetic data. Through the experiments, conducted in the Chinese-to-Vietnamese translation task, we aim to provide a guideline to the application of several methods such as back-translation, tagged back-translation, self-training and sentence concatenation in a low-resource, less-related language pair. Our results suggest that choosing the appropriate amount of synthetic data is a crucial task when building NMT systems. In addition, when combining methods, it is recommended to tag the data sources before training.","PeriodicalId":32180,"journal":{"name":"Journal of Information and Telecommunication","volume":"7 1","pages":"241 - 253"},"PeriodicalIF":2.7000,"publicationDate":"2023-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese\",\"authors\":\"Huan Vu, Ngoc-Dung Bui\",\"doi\":\"10.1080/24751839.2023.2186625\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT Neural Machine Translation (NMT) has constantly been shown to be a standard choice to build a translation system, in both academia and industry. For low-resource language pairs, data augmentation techniques have been widely used to tackle the data shortage problem in NMT. In this paper, we investigate the scaling behaviour of transformer-based NMT model to the increasing amount of synthetic data. Through the experiments, conducted in the Chinese-to-Vietnamese translation task, we aim to provide a guideline to the application of several methods such as back-translation, tagged back-translation, self-training and sentence concatenation in a low-resource, less-related language pair. Our results suggest that choosing the appropriate amount of synthetic data is a crucial task when building NMT systems. In addition, when combining methods, it is recommended to tag the data sources before training.\",\"PeriodicalId\":32180,\"journal\":{\"name\":\"Journal of Information and Telecommunication\",\"volume\":\"7 1\",\"pages\":\"241 - 253\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2023-03-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information and Telecommunication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/24751839.2023.2186625\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information and Telecommunication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/24751839.2023.2186625","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
On the scalability of data augmentation techniques for low-resource machine translation between Chinese and Vietnamese
ABSTRACT Neural Machine Translation (NMT) has constantly been shown to be a standard choice to build a translation system, in both academia and industry. For low-resource language pairs, data augmentation techniques have been widely used to tackle the data shortage problem in NMT. In this paper, we investigate the scaling behaviour of transformer-based NMT model to the increasing amount of synthetic data. Through the experiments, conducted in the Chinese-to-Vietnamese translation task, we aim to provide a guideline to the application of several methods such as back-translation, tagged back-translation, self-training and sentence concatenation in a low-resource, less-related language pair. Our results suggest that choosing the appropriate amount of synthetic data is a crucial task when building NMT systems. In addition, when combining methods, it is recommended to tag the data sources before training.