{"title":"Computational Phraseology light: automatic translation of multiword expressions without translation resources","authors":"R. Mitkov","doi":"10.1515/phras-2016-0008","DOIUrl":null,"url":null,"abstract":"Abstract This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proof-of-concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. However, see section 6 which discusses experiments with different comparability scores. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.'s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1515/phras-2016-0008","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1515/phras-2016-0008","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Abstract This paper describes the first phase of a project whose ultimate goal is the implementation of a practical tool to support the work of language learners and translators by automatically identifying multiword expressions (MWEs) and retrieving their translations for any pair of languages. The task of translating multiword expressions is viewed as a two-stage process. The first stage is the extraction of MWEs in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents. This project pursues the development of a knowledge-poor approach for any pair of languages which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. In line with this philosophy, the methodology developed does not rely on any dictionaries or parallel corpora, nor does it use any (bilingual) grammars. The only information comes from comparable corpora, inexpensively compiled. The first proof-of-concept stage of this project covers English and Spanish and focuses on a particular subclass of MWEs: verb-noun expressions (collocations) such as take advantage, make sense, prestar atención and tener derecho. The choice of genre was determined by the fact that newswire is a widespread genre and available in different languages. An additional motivation was the fact that the methodology was developed as language independent with the objective of applying it to and testing it for different languages. The ACCURAT toolkit (Pinnis et al. 2012; Skadina et al. 2012; Su and Babych 2012a) was employed to compile automatically the comparable corpora and documents only above a specific threshold were considered for inclusion. More specifically, only pairs of English and Spanish documents with comparability score (cosine similarity) higher 0.45 were extracted. However, see section 6 which discusses experiments with different comparability scores. Statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. This study focused on and compared four popular and established measures along with frequency: Log-likelihood ratio, T-Score, Log Dice and Salience. This project follows the distributional similarity premise which stipulates that translation equivalents share common words in their contexts and this applies also to multiword expressions. The Vector Space Model is traditionally used to represent words with their co-occurrences and to measure similarity. The vector representation for any word is constructed from the statistics of the occurrences of that word with other specific/context words in a corpus of texts. In this study, the word2vec method (Mikolov et al. 2013) was employed. Mikolov et al.'s method utilises patterns of word co-occurrences within a small window to predict similarities among words. Evaluation results are reported for both extracting MWEs and their automatic translation. A finding of the evaluation worth mentioning is that the size of the comparable corpora is more important for the performance of automatic translation of MWEs than the similarity between them as long as the comparable corpora used are of minimal similarity.
本文描述了一个项目的第一阶段,该项目的最终目标是实现一个实用的工具,通过自动识别多词表达(MWEs)并检索任何一对语言的翻译,来支持语言学习者和翻译人员的工作。翻译多词短语的任务被视为一个两个阶段的过程。第一阶段是提取每种语言的MWEs;第二阶段是对每种语言中提取的MWEs进行匹配过程,并提出翻译对等物。该项目旨在开发一种知识贫乏的方法,适用于任何语言对,不依赖于词典、翻译记忆库或平行语料库等翻译资源,这些资源开发耗时或难以获取,价格昂贵或专有。根据这一理念,开发的方法不依赖于任何字典或平行语料库,也不使用任何(双语)语法。唯一的信息来自可比较的语料库,成本低廉。该项目的第一个概念验证阶段涵盖英语和西班牙语,并侧重于MWEs的一个特定子类:动词-名词表达式(搭配),如take advantage、make sense、prestar atención和tener derecho。体裁的选择取决于新闻专线是一种广泛存在的体裁,并且有不同的语言版本。另一个动机是,该方法是作为独立于语言的方法开发的,目的是将其应用于不同的语言并对其进行测试。ACCURAT工具包(Pinnis et al. 2012;Skadina et al. 2012;采用Su和Babych 2012a)自动编译可比语料库,只有超过特定阈值的文档才会被考虑纳入。更具体地说,只提取可比性评分(余弦相似度)高于0.45的英语和西班牙语文档对。但是,请参见第6节,其中讨论了具有不同可比性分数的实验。采用统计关联度量来量化两个词之间关系的强度,并提出超过特定阈值的动词和名词的组合将是多词表达(候选)。本研究关注并比较了与频率相关的四种流行且已建立的测量方法:对数似然比、T-Score、对数骰子和显著性。本项目遵循分布相似前提,即翻译对等物在其上下文中共享常用词,这也适用于多词表达。向量空间模型传统上被用来表示单词的共现和测量相似度。任何单词的向量表示都是根据该单词在文本语料库中与其他特定/上下文单词出现的统计数据构建的。本研究采用word2vec方法(Mikolov et al. 2013)。Mikolov等人的方法利用小窗口内单词共现的模式来预测单词之间的相似度。报告了MWEs提取和自动翻译的评价结果。值得一提的是,该评估的一个发现是,只要所使用的可比语料库具有最小的相似性,相对于它们之间的相似性,可比语料库的大小对MWEs自动翻译的性能更重要。