Elpis:面向可扩展数据科学的基于图的相似性搜索

Proc. VLDB Endow. Pub Date : 2023-02-01 DOI:10.14778/3583140.3583166

Ilias Azizi, Karima Echihabi, Themis Palpanas

{"title":"Elpis:面向可扩展数据科学的基于图的相似性搜索","authors":"Ilias Azizi, Karima Echihabi, Themis Palpanas","doi":"10.14778/3583140.3583166","DOIUrl":null,"url":null,"abstract":"\n The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data. Finding similar vectors in these collections is at the core of many important and practical data science applications. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no-guarantees\n ng\n -approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose ELPIS, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. ELPIS builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods, and answers 1-NN queries up to one order of magnitude faster.\n","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":"33 1","pages":"1548-1559"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Elpis: Graph-Based Similarity Search for Scalable Data Science\",\"authors\":\"Ilias Azizi, Karima Echihabi, Themis Palpanas\",\"doi\":\"10.14778/3583140.3583166\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data. Finding similar vectors in these collections is at the core of many important and practical data science applications. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no-guarantees\\n ng\\n -approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose ELPIS, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. ELPIS builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods, and answers 1-NN queries up to one order of magnitude faster.\\n\",\"PeriodicalId\":20467,\"journal\":{\"name\":\"Proc. VLDB Endow.\",\"volume\":\"33 1\",\"pages\":\"1548-1559\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. VLDB Endow.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.14778/3583140.3583166\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. VLDB Endow.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.14778/3583140.3583166","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

最近学习嵌入的流行推动了大量高维(high-d)向量集合的增长，这些向量对复杂数据进行建模。在这些集合中寻找相似的向量是许多重要和实用的数据科学应用的核心。数据序列社区已经开发了基于树的相似性搜索技术，在数据序列和通用高维向量的大型集合上，除了无保证的ng近似搜索(由高维向量社区设计的基于图的方法实现最佳性能)之外，在所有场景中都优于最先进的方法。然而，构建基于图的索引在时间和空间上都非常昂贵。在本文中，我们将这两个世界结合在一起，研究了相应的解决方案及其性能行为，并提出了ELPIS，这是一种新的强基线，它利用了两者的最佳特性，在索引和内存中近似搜索方面实现了卓越的性能。ELPIS构建索引的速度比竞争对手快3 -8倍，使用的内存少40%。它还实现了0.99的高召回率，比最先进的方法快2倍，并且回答1-NN查询的速度快了一个数量级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Elpis: Graph-Based Similarity Search for Scalable Data Science

The recent popularity of learned embeddings has fueled the growth of massive collections of high-dimensional (high-d) vectors that model complex data. Finding similar vectors in these collections is at the core of many important and practical data science applications. The data series community has developed tree-based similarity search techniques that outperform state-of-the-art methods on large collections of both data series and generic high-d vectors, on all scenarios except for no-guarantees ng -approximate search, where graph-based approaches designed by the high-d vector community achieve the best performance. However, building graph-based indexes is extremely expensive both in time and space. In this paper, we bring these two worlds together, study the corresponding solutions and their performance behavior, and propose ELPIS, a new strong baseline that takes advantage of the best features of both to achieve a superior performance in terms of indexing and ng-approximate search in-memory. ELPIS builds the index 3x-8x faster than competitors, using 40% less memory. It also achieves a high recall of 0.99, up to 2x faster than the state-of-the-art methods, and answers 1-NN queries up to one order of magnitude faster.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proc. VLDB Endow.

自引率

0.00%

发文量