通过张拉随机投影改进LSH

IF 0.5 4区计算机科学 Q4 COMPUTER SCIENCE, INFORMATION SYSTEMS

Acta Informatica Pub Date : 2025-02-04 DOI:10.1007/s00236-025-00479-x

Bhisham Dev Verma, Rameshwar Pratap

{"title":"通过张拉随机投影改进LSH","authors":"Bhisham Dev Verma, Rameshwar Pratap","doi":"10.1007/s00236-025-00479-x","DOIUrl":null,"url":null,"abstract":"<div><p>Locality-sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large-scale data processing applications such as near-duplicate detection, nearest-neighbour search, clustering, etc. In this work, we aim to propose faster and space-efficient locality-sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data. However, this approach becomes impractical for higher-order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH’s parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely CP-E2LSH, TT-E2LSH, and CP-SRP, TT-SRP, respectively, building on CP and tensor train (TT) decompositions techniques. Our approaches are space-efficient and can be efficiently applied to low-rank CP or TT tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.</p></div>","PeriodicalId":7189,"journal":{"name":"Acta Informatica","volume":"62 1","pages":""},"PeriodicalIF":0.5000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving LSH via tensorized random projection\",\"authors\":\"Bhisham Dev Verma, Rameshwar Pratap\",\"doi\":\"10.1007/s00236-025-00479-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Locality-sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large-scale data processing applications such as near-duplicate detection, nearest-neighbour search, clustering, etc. In this work, we aim to propose faster and space-efficient locality-sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data. However, this approach becomes impractical for higher-order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH’s parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely CP-E2LSH, TT-E2LSH, and CP-SRP, TT-SRP, respectively, building on CP and tensor train (TT) decompositions techniques. Our approaches are space-efficient and can be efficiently applied to low-rank CP or TT tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.</p></div>\",\"PeriodicalId\":7189,\"journal\":{\"name\":\"Acta Informatica\",\"volume\":\"62 1\",\"pages\":\"\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2025-02-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Acta Informatica\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s00236-025-00479-x\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Informatica","FirstCategoryId":"94","ListUrlMain":"https://link.springer.com/article/10.1007/s00236-025-00479-x","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

位置敏感散列（LSH）是数据科学家用于近似近邻搜索问题的基本算法工具包，已广泛用于许多大规模数据处理应用程序，如近重复检测，近邻搜索，聚类等。在这项工作中，我们的目标是为张量数据的欧几里得距离和余弦相似度提出更快、更节省空间的位置敏感哈希函数。通常，获取张量数据LSH的简单方法包括首先将张量重塑为向量，然后将现有的LSH方法应用于向量数据。然而，这种方法对于高阶张量变得不切实际，因为重塑向量的大小在张量的顺序上变成指数。因此，LSH参数的大小呈指数增长。为了解决这一问题，我们在CP和张量序列（TT）分解技术的基础上，提出了两种基于欧几里得距离和余弦相似度的LSH方法，分别是CP- e2lsh， TT- e2lsh和CP- srp， TT- srp。我们的方法具有空间效率，可以有效地应用于低秩CP或TT张量。我们对我们的建议的正确性和有效性进行了严格的理论分析。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving LSH via tensorized random projection

Locality-sensitive hashing (LSH) is a fundamental algorithmic toolkit used by data scientists for approximate nearest neighbour search problems that have been used extensively in many large-scale data processing applications such as near-duplicate detection, nearest-neighbour search, clustering, etc. In this work, we aim to propose faster and space-efficient locality-sensitive hash functions for Euclidean distance and cosine similarity for tensor data. Typically, the naive approach for obtaining LSH for tensor data involves first reshaping the tensor into vectors, followed by applying existing LSH methods for vector data. However, this approach becomes impractical for higher-order tensors because the size of the reshaped vector becomes exponential in the order of the tensor. Consequently, the size of LSH’s parameters increases exponentially. To address this problem, we suggest two methods for LSH for Euclidean distance and cosine similarity, namely CP-E2LSH, TT-E2LSH, and CP-SRP, TT-SRP, respectively, building on CP and tensor train (TT) decompositions techniques. Our approaches are space-efficient and can be efficiently applied to low-rank CP or TT tensors. We provide a rigorous theoretical analysis of our proposal on their correctness and efficacy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Acta Informatica 工程技术-计算机：信息系统

CiteScore

2.40

自引率

16.70%

发文量

审稿时长

>12 weeks

期刊介绍： Acta Informatica provides international dissemination of articles on formal methods for the design and analysis of programs, computing systems and information structures, as well as related fields of Theoretical Computer Science such as Automata Theory, Logic in Computer Science, and Algorithmics. Topics of interest include: • semantics of programming languages • models and modeling languages for concurrent, distributed, reactive and mobile systems • models and modeling languages for timed, hybrid and probabilistic systems • specification, program analysis and verification • model checking and theorem proving • modal, temporal, first- and higher-order logics, and their variants • constraint logic, SAT/SMT-solving techniques • theoretical aspects of databases, semi-structured data and finite model theory • theoretical aspects of artificial intelligence, knowledge representation, description logic • automata theory, formal languages, term and graph rewriting • game-based models, synthesis • type theory, typed calculi • algebraic, coalgebraic and categorical methods • formal aspects of performance, dependability and reliability analysis • foundations of information and network security • parallel, distributed and randomized algorithms • design and analysis of algorithms • foundations of network and communication protocols.