蒙汉神经机器翻译中的子词嵌入辅助编码

Tiangang Bai, H. Hou, Yatu Ji
{"title":"蒙汉神经机器翻译中的子词嵌入辅助编码","authors":"Tiangang Bai, H. Hou, Yatu Ji","doi":"10.1145/3384544.3384565","DOIUrl":null,"url":null,"abstract":"For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.","PeriodicalId":200246,"journal":{"name":"Proceedings of the 2020 9th International Conference on Software and Computer Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation\",\"authors\":\"Tiangang Bai, H. Hou, Yatu Ji\",\"doi\":\"10.1145/3384544.3384565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.\",\"PeriodicalId\":200246,\"journal\":{\"name\":\"Proceedings of the 2020 9th International Conference on Software and Computer Applications\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 9th International Conference on Software and Computer Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3384544.3384565\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 9th International Conference on Software and Computer Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3384544.3384565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

对于低资源的蒙汉神经机器翻译,常用的预处理方法如字节对编码(BPE)和标记化无法识别蒙文特殊字符,导致完整句子信息的丢失。由于数据稀疏性问题,低频词的翻译质量不理想。本文首先提出了一种蒙古语特殊字符的处理方法,将蒙古语特殊字符转换为显式形式,减少了预处理误差。其次,根据蒙古语的形态学知识,生成大规模单语语料库的子词嵌入,增强低频词表示的语境信息;实验表明:(1)蒙古语特殊字符处理可以最大限度地减少语义损失;(2)从大规模单语语料库中嵌入子词的系统可以有效地捕获低频词的语义信息;(3)所提出的方法可以在基线上提高1-2个BLEU点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation
For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信