Yasmin Moslem, Rejwanul Haque, John D. Kelleher, Andy Way
{"title":"Domain-Specific Text Generation for Machine Translation","authors":"Yasmin Moslem, Rejwanul Haque, John D. Kelleher, Andy Way","doi":"10.48550/arXiv.2208.05909","DOIUrl":"https://doi.org/10.48550/arXiv.2208.05909","url":null,"abstract":"Preservation of domain knowledge from the source to target is crucial in any translation workflow. It is common in the translation industry to receive highly-specialized projects, where there is hardly any parallel in-domain data. In such scenarios where there is insufficient in-domain data to fine-tune Machine Translation (MT) models, producing translations that are consistent with the relevant context is challenging. In this work, we propose leveraging state-of-the-art pretrained language models (LMs) for domain-specific data augmentation for MT, simulating the domain characteristics of either (a) a small bilingual dataset, or (b) the monolingual source text to be translated. Combining this idea with back-translation, we can generate huge amounts of synthetic bilingual in-domain data for both use cases. For our investigation, we used the state-of-the-art MT architecture, Transformer. We employed mixed fine-tuning to train models that significantly improve translation of in-domain texts. More specifically, our proposed methods achieved improvements of approximately 5-6 BLEU and 2-3 BLEU, respectively, on Arabic-to-English and English-to-Arabic language pairs. Furthermore, the outcome of human evaluation corroborates the automatic evaluation results.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114757328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?","authors":"Ali Araabi, Christof Monz, Vlad Niculae","doi":"10.48550/arXiv.2208.05225","DOIUrl":"https://doi.org/10.48550/arXiv.2208.05225","url":null,"abstract":"Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131492892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Licht, Cynthia Gao, Janice Lam, Francisco Guzmán, Mona T. Diab, Philipp Koehn
{"title":"Consistent Human Evaluation of Machine Translation across Language Pairs","authors":"Daniel Licht, Cynthia Gao, Janice Lam, Francisco Guzmán, Mona T. Diab, Philipp Koehn","doi":"10.48550/arXiv.2205.08533","DOIUrl":"https://doi.org/10.48550/arXiv.2205.08533","url":null,"abstract":"Obtaining meaningful quality scores for machine translation systems through human evaluation remains a challenge given the high variability between human evaluators, partly due to subjective expectations for translation quality for different language pairs. We propose a new metric called XSTS that is more focused on semantic equivalence and a cross-lingual calibration method that enables more consistent assessment. We demonstrate the effectiveness of these novel contributions in large scale evaluation studies across up to 14 language pairs, with translation both into and out of English.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129088555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiyue Zhang, Vishrav Chaudhary, Naman Goyal, James Cross, Guillaume Wenzek, Mohit Bansal, Francisco Guzmán
{"title":"How Robust is Neural Machine Translation to Language Imbalance in Multilingual Tokenizer Training?","authors":"Shiyue Zhang, Vishrav Chaudhary, Naman Goyal, James Cross, Guillaume Wenzek, Mohit Bansal, Francisco Guzmán","doi":"10.48550/arXiv.2204.14268","DOIUrl":"https://doi.org/10.48550/arXiv.2204.14268","url":null,"abstract":"A multilingual tokenizer is a fundamental component of multilingual neural machine translation. It is trained from a multilingual corpus. Since a skewed data distribution is considered to be harmful, a sampling strategy is usually used to balance languages in the corpus. However, few works have systematically answered how language imbalance in tokenizer training affects downstream performance. In this work, we analyze how translation performance changes as the data ratios among languages vary in the tokenizer training corpus. We find that while relatively better performance is often observed when languages are more equally sampled, the downstream performance is more robust to language imbalance than we usually expected. Two features, UNK rate and closeness to the character level, can warn of poor downstream performance before performing the task. We also distinguish language sampling for tokenizer training from sampling for model training and show that the model is more sensitive to the latter.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114325994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Felix Stahlberg, Danielle Saunders, Gonzalo Iglesias, B. Byrne
{"title":"Why not be Versatile? Applications of the SGNMT Decoder for Machine Translation","authors":"Felix Stahlberg, Danielle Saunders, Gonzalo Iglesias, B. Byrne","doi":"10.17863/CAM.25872","DOIUrl":"https://doi.org/10.17863/CAM.25872","url":null,"abstract":"SGNMT is a decoding platform for machine translation which allows paring various modern neural models of translation with different kinds of constraints and symbolic models. In this paper, we describe three use cases in which SGNMT is currently playing an active role: (1) teaching as SGNMT is being used for course work and student theses in the MPhil in Machine Learning, Speech and Language Technology at the University of Cambridge, (2) research as most of the research work of the Cambridge MT group is based on SGNMT, and (3) technology transfer as we show how SGNMT is helping to transfer research findings from the laboratory to the industry, eg. into a product of SDL plc.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126403205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francis Bond, Seiji Okura, Yu Yamamoto, Toshiki Murata, Kiyotaka Uchimoto, Michael Kato, Miwako Shimazu, Tsugiyoshi Suzuki
{"title":"Sharing User Dictionaries Across Multiple Systems with UTX-S","authors":"Francis Bond, Seiji Okura, Yu Yamamoto, Toshiki Murata, Kiyotaka Uchimoto, Michael Kato, Miwako Shimazu, Tsugiyoshi Suzuki","doi":"10.1145/1499224.1499247","DOIUrl":"https://doi.org/10.1145/1499224.1499247","url":null,"abstract":"Careful tuning of user-created dictionaries is indispensable when using a machine translation system for computer aided translation. However, there is no widely used standard for user dictionaries in the Japanese/English machine translation market. To address this issue, AAMT (the Asia-Pacific Association for Machine Translation) has established a specification of sharable dictionaries (UTX-S: Universal Terminology eXchange -- Simple), which can be used across different machine translation systems, thus increasing the interoperability of language resources. UTX-S is simpler than existing specifications such as UPF and OLIF. It was explicitly designed to make it easy to (a) add new user dictionaries and (b) share existing user dictionaries. This facilitates rapid user dictionary production and avoids vendor tie in. In this study we describe the UTX-Simple (UTX-S) format, and show that it can be converted to the user dictionary formats for five commercial English-Japanese MT systems. We then present a case study where we (a) convert an on-line glossary to UTX-S, and (b) produce user dictionaries for five different systems, and then exchange them. The results show that the simplified format of UTX-S can be used to rapidly build dictionaries. Further, we confirm that customized user dictionaries are effective across systems, although with a slight loss in quality: on average, user dictionaries improved the translations for 44.8% of translations with the systems they were built for and 37.3% of translations for different systems. In ongoing work, AAMT is using UTX-S as the format in building up a user community for producing, sharing, and accumulating user dictionaries in a sustainable way.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117132045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A super-function based Japanese-Chinese machine translation system for business users","authors":"Xin Zhao, F. Ren, S. Voß","doi":"10.1007/978-3-540-30194-3_30","DOIUrl":"https://doi.org/10.1007/978-3-540-30194-3_30","url":null,"abstract":"","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117162629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A structurally diverse minimal corpus for eliciting structural mappings between languages","authors":"Katharina Probst, A. Lavie","doi":"10.1007/978-3-540-30194-3_24","DOIUrl":"https://doi.org/10.1007/978-3-540-30194-3_24","url":null,"abstract":"","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125559133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine translation of online product support articles using data-driven MT system","authors":"Stephen D. Richardson","doi":"10.1007/978-3-540-30194-3_27","DOIUrl":"https://doi.org/10.1007/978-3-540-30194-3_27","url":null,"abstract":"","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115936116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigation of intelligibility judgments","authors":"F. Reeder","doi":"10.1007/978-3-540-30194-3_25","DOIUrl":"https://doi.org/10.1007/978-3-540-30194-3_25","url":null,"abstract":"","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134628549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}