Learning better transliterations

Proceedings of the 18th ACM conference on Information and knowledge management Pub Date : 2009-11-02 DOI:10.1145/1645953.1645978

Jeff Pasternack, D. Roth

{"title":"Learning better transliterations","authors":"Jeff Pasternack, D. Roth","doi":"10.1145/1645953.1645978","DOIUrl":null,"url":null,"abstract":"We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of \"productions\", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.","PeriodicalId":286251,"journal":{"name":"Proceedings of the 18th ACM conference on Information and knowledge management","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1645953.1645978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

Abstract

We introduce a new probabilistic model for transliteration that performs significantly better than previous approaches, is language-agnostic, requiring no knowledge of the source or target languages, and is capable of both generation (creating the most likely transliteration of a source word) and discovery (selecting the most likely transliteration from a list of candidate words). Our experimental results demonstrate improved accuracy over the existing state-of-the-art by more than 10% in Chinese, Hebrew and Russian. While past work has commonly made use of fixed-size n-gram features along with more traditional models such as HMM or Perceptron, we utilize an intuitive notion of "productions", where each source word can be segmented into a series of contiguous, non-overlapping substrings of any size, each of which independently transliterates to a substring in the target language with a given probability. To learn these parameters, we employ Expectation-Maximization (EM), with the alignment between substrings in the source and target word training pairs as our latent data. Despite the size of the parameter space and the 2(|w|-1) possible segmentations to consider for each word, by using dynamic programming each iteration of EM takes O(m^6 * n) time, where m is the length of the longest word in the data and n is the number of word pairs, and is very fast in practice. Furthermore, discovering transliterations takes only O(m^4 * w) time, where w is the number of candidate words to choose from, and generating a transliteration takes O(m2 * k2) time, where k is a pruning constant (we used a value of 100). Additionally, we are able to obtain training examples in an unsupervised fashion from Wikipedia by using a relatively simple algorithm to filter potential word pairs.

查看原文本刊更多论文

学习更好的音译

我们为音译引入了一个新的概率模型，该模型的性能比以前的方法要好得多，它与语言无关，不需要了解源语言或目标语言，并且能够生成(创建源单词最可能的音译)和发现(从候选单词列表中选择最可能的音译)。我们的实验结果表明，在汉语、希伯来语和俄语中，准确度比现有的先进技术提高了10%以上。虽然过去的工作通常使用固定大小的n-gram特征以及更传统的模型(如HMM或Perceptron)，但我们使用了一个直观的“产品”概念，其中每个源词可以被分割成一系列连续的、不重叠的任意大小的子字符串，每个子字符串以给定的概率独立地转写为目标语言中的子字符串。为了学习这些参数，我们使用期望最大化(EM)，将源词和目标词训练对中的子字符串之间的对齐作为我们的潜在数据。尽管参数空间很大，并且每个单词需要考虑2(|w|-1)个可能的分割，但通过使用动态规划，EM的每次迭代需要O(m^6 * n)时间，其中m是数据中最长单词的长度，n是单词对的数量，并且在实践中非常快。此外，发现音译只需要O(m^4 * w)时间，其中w是可供选择的候选单词的数量，生成音译需要O(m2 * k2)时间，其中k是修剪常数(我们使用的值为100)。此外，我们能够通过使用相对简单的算法来过滤潜在的词对，以无监督的方式从维基百科获得训练样例。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th ACM conference on Information and knowledge management

自引率

0.00%

发文量