No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity

Ferhan Ture, T. Elsayed, Jimmy J. Lin
{"title":"No Free Lunch: Brute Force vs. Locality-Sensitive Hashing for Cross-lingual Pairwise Similarity","authors":"Ferhan Ture, T. Elsayed, Jimmy J. Lin","doi":"10.1145/2009916.2010042","DOIUrl":null,"url":null,"abstract":"This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as“no free lunch”: there is no single optimal solution. Instead, we characterize effectivenessefficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on applicationand resource-specific constraints.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"9 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"46","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010042","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 46

Abstract

This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multilingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as“no free lunch”: there is no single optimal solution. Instead, we characterize effectivenessefficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on applicationand resource-specific constraints.
没有免费的午餐:蛮力vs.地域敏感哈希跨语言两两相似性
这项工作探讨了跨语言配对相似的问题,其中的任务是在两种不同的语言中提取相似的文档对。这个问题的解决方案对于多语言上下文中的文本挖掘具有普遍的意义,并且在统计机器翻译中具有特定的应用。该方法利用跨语言信息检索(CLIR)技术将特征向量从一种语言投影到另一种语言,然后使用位置敏感散列(LSH)提取相似对。我们表明,有效的跨语言配对相似性需要使用远低于典型单语言应用程序的相似性阈值,这使得问题相当具有挑战性。我们提出了一个并行的、可扩展的基于排序的滑动窗口算法的MapReduce实现,并将其与德语和英语维基百科集合的暴力破解方法进行了比较。我们的核心发现可以概括为“天下没有免费的午餐”:没有单一的最优解决方案。相反,我们描述了解决方案空间中的有效性和效率权衡,这可以指导开发人员根据应用程序和资源特定的约束找到理想的操作点。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信