日中字符级神经机器翻译的字符分解

2019 International Conference on Asian Language Processing (IALP) Pub Date : 2019-11-01 DOI:10.1109/IALP48816.2019.9037677

Jinyi Zhang, Tadahiro Matsumoto

{"title":"日中字符级神经机器翻译的字符分解","authors":"Jinyi Zhang, Tadahiro Matsumoto","doi":"10.1109/IALP48816.2019.9037677","DOIUrl":null,"url":null,"abstract":"After years of development, Neural Machine Translation (NMT) has produced richer translation results than ever over various language pairs, becoming a new machine translation model with great potential. For the NMT model, it can only translate words/characters contained in the training data. One problem on NMT is handling of the low-frequency words/characters in the training data. In this paper, we propose a method for removing characters whose frequencies of appearance are less than a given minimum threshold by decomposing such characters into their components and/or pseudo-characters, using the Chinese character decomposition table we made. Experiments of Japanese-to-Chinese and Chinese-to-Japanese NMT with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) corpus show that the BLEU scores, the training time and the number of parameters are varied with the number of the given minimum thresholds of decomposed characters.","PeriodicalId":208066,"journal":{"name":"2019 International Conference on Asian Language Processing (IALP)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation\",\"authors\":\"Jinyi Zhang, Tadahiro Matsumoto\",\"doi\":\"10.1109/IALP48816.2019.9037677\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"After years of development, Neural Machine Translation (NMT) has produced richer translation results than ever over various language pairs, becoming a new machine translation model with great potential. For the NMT model, it can only translate words/characters contained in the training data. One problem on NMT is handling of the low-frequency words/characters in the training data. In this paper, we propose a method for removing characters whose frequencies of appearance are less than a given minimum threshold by decomposing such characters into their components and/or pseudo-characters, using the Chinese character decomposition table we made. Experiments of Japanese-to-Chinese and Chinese-to-Japanese NMT with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) corpus show that the BLEU scores, the training time and the number of parameters are varied with the number of the given minimum thresholds of decomposed characters.\",\"PeriodicalId\":208066,\"journal\":{\"name\":\"2019 International Conference on Asian Language Processing (IALP)\",\"volume\":\"98 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 International Conference on Asian Language Processing (IALP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP48816.2019.9037677\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Asian Language Processing (IALP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP48816.2019.9037677","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

经过多年的发展，神经机器翻译(NMT)在各种语言对上的翻译结果比以往更加丰富，成为一种具有巨大潜力的新型机器翻译模型。对于NMT模型，它只能翻译训练数据中包含的单词/字符。NMT中的一个问题是训练数据中低频词/字符的处理。本文提出了一种去除出现频率小于给定最小阈值的汉字的方法，该方法使用我们制作的汉字分解表，将这些汉字分解为其组成和/或伪字符。用ASPEC-JC (Asian Scientific Paper摘录Corpus, Japanese-Chinese)语料库对日文-汉文和中文-日文NMT进行的实验表明，BLEU分数、训练时间和参数数量随给定的分解字符最小阈值的个数而变化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation

After years of development, Neural Machine Translation (NMT) has produced richer translation results than ever over various language pairs, becoming a new machine translation model with great potential. For the NMT model, it can only translate words/characters contained in the training data. One problem on NMT is handling of the low-frequency words/characters in the training data. In this paper, we propose a method for removing characters whose frequencies of appearance are less than a given minimum threshold by decomposing such characters into their components and/or pseudo-characters, using the Chinese character decomposition table we made. Experiments of Japanese-to-Chinese and Chinese-to-Japanese NMT with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) corpus show that the BLEU scores, the training time and the number of parameters are varied with the number of the given minimum thresholds of decomposed characters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 International Conference on Asian Language Processing (IALP)

自引率

0.00%

发文量