蒙汉神经机器翻译中的子词嵌入辅助编码

Proceedings of the 2020 9th International Conference on Software and Computer Applications Pub Date : 2020-02-18 DOI:10.1145/3384544.3384565

Tiangang Bai, H. Hou, Yatu Ji

{"title":"蒙汉神经机器翻译中的子词嵌入辅助编码","authors":"Tiangang Bai, H. Hou, Yatu Ji","doi":"10.1145/3384544.3384565","DOIUrl":null,"url":null,"abstract":"For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.","PeriodicalId":200246,"journal":{"name":"Proceedings of the 2020 9th International Conference on Software and Computer Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation\",\"authors\":\"Tiangang Bai, H. Hou, Yatu Ji\",\"doi\":\"10.1145/3384544.3384565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.\",\"PeriodicalId\":200246,\"journal\":{\"name\":\"Proceedings of the 2020 9th International Conference on Software and Computer Applications\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 9th International Conference on Software and Computer Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3384544.3384565\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 9th International Conference on Software and Computer Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3384544.3384565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对于低资源的蒙汉神经机器翻译，常用的预处理方法如字节对编码(BPE)和标记化无法识别蒙文特殊字符，导致完整句子信息的丢失。由于数据稀疏性问题，低频词的翻译质量不理想。本文首先提出了一种蒙古语特殊字符的处理方法，将蒙古语特殊字符转换为显式形式，减少了预处理误差。其次，根据蒙古语的形态学知识，生成大规模单语语料库的子词嵌入，增强低频词表示的语境信息;实验表明:(1)蒙古语特殊字符处理可以最大限度地减少语义损失;(2)从大规模单语语料库中嵌入子词的系统可以有效地捕获低频词的语义信息;(3)所提出的方法可以在基线上提高1-2个BLEU点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation

For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2020 9th International Conference on Software and Computer Applications

自引率

0.00%

发文量