How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?

Ali Araabi, Christof Monz, Vlad Niculae
{"title":"How Effective is Byte Pair Encoding for Out-Of-Vocabulary Words in Neural Machine Translation?","authors":"Ali Araabi, Christof Monz, Vlad Niculae","doi":"10.48550/arXiv.2208.05225","DOIUrl":null,"url":null,"abstract":"Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.","PeriodicalId":201231,"journal":{"name":"Conference of the Association for Machine Translation in the Americas","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference of the Association for Machine Translation in the Americas","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.05225","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Neural Machine Translation (NMT) is an open vocabulary problem. As a result, dealing with the words not occurring during training (a.k.a. out-of-vocabulary (OOV) words) have long been a fundamental challenge for NMT systems. The predominant method to tackle this problem is Byte Pair Encoding (BPE) which splits words, including OOV words, into sub-word segments. BPE has achieved impressive results for a wide range of translation tasks in terms of automatic evaluation metrics. While it is often assumed that by using BPE, NMT systems are capable of handling OOV words, the effectiveness of BPE in translating OOV words has not been explicitly measured. In this paper, we study to what extent BPE is successful in translating OOV words at the word-level. We analyze the translation quality of OOV words based on word type, number of segments, cross-attention weights, and the frequency of segment n-grams in the training data. Our experiments show that while careful BPE settings seem to be fairly useful in translating OOV words across datasets, a considerable percentage of OOV words are translated incorrectly. Furthermore, we highlight the slightly higher effectiveness of BPE in translating OOV words for special cases, such as named-entities and when the languages involved are linguistically close to each other.
字节对编码在神经机器翻译中对词汇外词的处理效果如何?
神经机器翻译是一个开放的词汇问题。因此,处理在训练过程中没有出现的单词(也称为词汇外单词(OOV))一直是NMT系统的一个基本挑战。解决这个问题的主要方法是字节对编码(BPE),它将单词(包括OOV单词)分成子词段。在自动评估指标方面,BPE在广泛的翻译任务中取得了令人印象深刻的结果。虽然人们通常认为,通过使用BPE, NMT系统能够处理面向对象的单词,但BPE在翻译面向对象单词方面的有效性尚未得到明确的衡量。在本文中,我们从词的层面研究了BPE在翻译OOV词时的成功程度。我们根据训练数据中的词类、词段数量、交叉注意权重和词段n-gram的频率来分析OOV词的翻译质量。我们的实验表明,虽然仔细的BPE设置在跨数据集翻译OOV单词时似乎相当有用,但相当大比例的OOV单词被错误翻译。此外,我们强调了BPE在翻译特殊情况下的OOV单词时略高的效率,例如命名实体和所涉及的语言在语言上彼此接近时。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信