{"title":"蒙汉神经机器翻译中的子词嵌入辅助编码","authors":"Tiangang Bai, H. Hou, Yatu Ji","doi":"10.1145/3384544.3384565","DOIUrl":null,"url":null,"abstract":"For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.","PeriodicalId":200246,"journal":{"name":"Proceedings of the 2020 9th International Conference on Software and Computer Applications","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation\",\"authors\":\"Tiangang Bai, H. Hou, Yatu Ji\",\"doi\":\"10.1145/3384544.3384565\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.\",\"PeriodicalId\":200246,\"journal\":{\"name\":\"Proceedings of the 2020 9th International Conference on Software and Computer Applications\",\"volume\":\"2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2020 9th International Conference on Software and Computer Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3384544.3384565\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 9th International Conference on Software and Computer Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3384544.3384565","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Sub-word Embedding Auxiliary Encoding in Mongolian-Chinese Neural Machine Translation
For low-resource Mongolian-Chinese neural machine translation (NMT), the common pre-processing methods such as byte pair encoding (BPE) and tokenization, are unable to recognize Mongolian special character, which leads to the loss of complete sentence information. The translation quality of low-frequency words is undesirable due to the problem of data sparsity. In this paper, we firstly propose a process method for Mongolian special character, which can transform the Mongolian special characters into explicit form to decrease the pre-processing error. Secondly, according to the morphological knowledge of Mongolian, we generate the sub-word embedding with large scale monolingual corpus to enhance the contextual information of the representation of low-frequency words. The experiments show that 1) Mongolian special character processing can minimize the semantic loss, 2) systems with sub-word embedding from large scale monolingual corpus can capture the semantic information of low-frequency words effectively 3) the proposed approaches can improve 1-2 BLEU points above the baselines.