Learning better transliterations

Jeff Pasternack, D. Roth
{"title":"Learning better transliterations","authors":"Jeff Pasternack, D. Roth","doi":"10.1145/1645953.1645978","DOIUrl":null,"url":null,"abstract":"We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of \"productions\", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1645953.1645978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16

Abstract

We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.
学习更好的音译
我们为音译引入了一个新的概率模型,该模型的性能比以前的方法要好得多,它与语言无关,不需要了解源语言或目标语言,并且能够生成(创建源单词最可能的音译)和发现(从候选单词列表中选择最可能的音译)。我们的实验结果表明,在汉语、希伯来语和俄语中,准确度比现有的先进技术提高了10%以上。虽然过去的工作通常使用固定大小的n-gram特征以及更传统的模型(如HMM或Perceptron),但我们使用了一个直观的“产品”概念,其中每个源词可以被分割成一系列连续的、不重叠的任意大小的子字符串,每个子字符串以给定的概率独立地转写为目标语言中的子字符串。为了学习这些参数,我们使用期望最大化(EM),将源词和目标词训练对中的子字符串之间的对齐作为我们的潜在数据。尽管参数空间很大,并且每个单词需要考虑2(|w|-1)个可能的分割,但通过使用动态规划,EM的每次迭代需要O(m^6 * n)时间,其中m是数据中最长单词的长度,n是单词对的数量,并且在实践中非常快。此外,发现音译只需要O(m^4 * w)时间,其中w是可供选择的候选单词的数量,生成音译需要O(m2 * k2)时间,其中k是修剪常数(我们使用的值为100)。此外,我们能够通过使用相对简单的算法来过滤潜在的词对,以无监督的方式从维基百科获得训练样例。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信