统计机器翻译的语言独立分词

Michael Paul, A. Finch, E. Sumita
{"title":"统计机器翻译的语言独立分词","authors":"Michael Paul, A. Finch, E. Sumita","doi":"10.1145/1667780.1667788","DOIUrl":null,"url":null,"abstract":"This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous text in order to optimize the translation quality of statistical machine translation (SMT) approaches. The proposed method is language-independent and uses a parallel corpus to align source language characters to the corresponding word units separated by whitespace in the target language. Successive characters aligned to the same target words are merged to a larger source language unit and a Maximum Entropy (ME) algorithm is applied to learn the word segmentation that optimizes the translation quality of an SMT system trained on the re-segmented bitext. Experimental results translating five Asian languages into English revealed that the proposed method outperforms a baseline system that translates unigram segmented source language sentences.","PeriodicalId":103128,"journal":{"name":"Proceedings of the 3rd International Universal Communication Symposium","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Language independent word segmentation for statistical machine translation\",\"authors\":\"Michael Paul, A. Finch, E. Sumita\",\"doi\":\"10.1145/1667780.1667788\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous text in order to optimize the translation quality of statistical machine translation (SMT) approaches. The proposed method is language-independent and uses a parallel corpus to align source language characters to the corresponding word units separated by whitespace in the target language. Successive characters aligned to the same target words are merged to a larger source language unit and a Maximum Entropy (ME) algorithm is applied to learn the word segmentation that optimizes the translation quality of an SMT system trained on the re-segmented bitext. Experimental results translating five Asian languages into English revealed that the proposed method outperforms a baseline system that translates unigram segmented source language sentences.\",\"PeriodicalId\":103128,\"journal\":{\"name\":\"Proceedings of the 3rd International Universal Communication Symposium\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd International Universal Communication Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1667780.1667788\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd International Universal Communication Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1667780.1667788","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

为了优化统计机器翻译(SMT)方法的翻译质量,提出了一种识别连续文本中的词边界的无监督分词算法。该方法与语言无关,使用平行语料库将源语言字符与目标语言中由空格分隔的相应单词单位对齐。将与相同目标词对齐的连续字符合并到更大的源语言单元中,并应用最大熵(Maximum Entropy, ME)算法学习分词,以优化在重新分割的文本上训练的SMT系统的翻译质量。将五种亚洲语言翻译成英语的实验结果表明,该方法优于翻译单图分割源语言句子的基线系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Language independent word segmentation for statistical machine translation
This paper proposes an unsupervised word segmentation algorithm that identifies word boundaries in continuous text in order to optimize the translation quality of statistical machine translation (SMT) approaches. The proposed method is language-independent and uses a parallel corpus to align source language characters to the corresponding word units separated by whitespace in the target language. Successive characters aligned to the same target words are merged to a larger source language unit and a Maximum Entropy (ME) algorithm is applied to learn the word segmentation that optimizes the translation quality of an SMT system trained on the re-segmented bitext. Experimental results translating five Asian languages into English revealed that the proposed method outperforms a baseline system that translates unigram segmented source language sentences.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信