面向可扩展信息检索的密集向量的无训练稀疏表示

IF 3.4 2区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Systems Pub Date : 2025-05-13 DOI:10.1016/j.is.2025.102567

Fabio Carrara, Lucia Vadicamo, Giuseppe Amato, Claudio Gennaro

{"title":"面向可扩展信息检索的密集向量的无训练稀疏表示","authors":"Fabio Carrara, Lucia Vadicamo, Giuseppe Amato, Claudio Gennaro","doi":"10.1016/j.is.2025.102567","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we propose and analyze Vec2Doc, a novel training-free method to transform dense vectors into sparse integer vectors, facilitating the use of inverted indexes for information retrieval (IR). The exponential growth of deep learning and artificial intelligence has revolutionized scientific problem-solving in areas such as computer vision, natural language processing, and automatic content generation. These advances have also significantly impacted IR, with a better understanding of natural language and multimodal content analysis leading to more accurate information retrieval. Despite these developments, modern IR relies primarily on the similarity evaluation of dense vectors from the latent spaces of deep neural networks. This dependence introduces substantial challenges in performing similarity searches on large collections containing billions of vectors. Traditional IR methods, which employ inverted indexes and vector space models, are adept at handling sparse vectors but do not work well with dense ones. Vec2Doc attempts to fill this gap by converting dense vectors into a format compatible with conventional inverted index techniques. Our preliminary experimental evaluations show that Vec2Doc is a promising solution to overcome the scalability problems inherent in vector-based IR, offering an alternative method for efficient and accurate large-scale information retrieval.</div></div>","PeriodicalId":50363,"journal":{"name":"Information Systems","volume":"133 ","pages":"Article 102567"},"PeriodicalIF":3.4000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Training-free sparse representations of dense vectors for scalable information retrieval\",\"authors\":\"Fabio Carrara, Lucia Vadicamo, Giuseppe Amato, Claudio Gennaro\",\"doi\":\"10.1016/j.is.2025.102567\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In this paper, we propose and analyze Vec2Doc, a novel training-free method to transform dense vectors into sparse integer vectors, facilitating the use of inverted indexes for information retrieval (IR). The exponential growth of deep learning and artificial intelligence has revolutionized scientific problem-solving in areas such as computer vision, natural language processing, and automatic content generation. These advances have also significantly impacted IR, with a better understanding of natural language and multimodal content analysis leading to more accurate information retrieval. Despite these developments, modern IR relies primarily on the similarity evaluation of dense vectors from the latent spaces of deep neural networks. This dependence introduces substantial challenges in performing similarity searches on large collections containing billions of vectors. Traditional IR methods, which employ inverted indexes and vector space models, are adept at handling sparse vectors but do not work well with dense ones. Vec2Doc attempts to fill this gap by converting dense vectors into a format compatible with conventional inverted index techniques. Our preliminary experimental evaluations show that Vec2Doc is a promising solution to overcome the scalability problems inherent in vector-based IR, offering an alternative method for efficient and accurate large-scale information retrieval.</div></div>\",\"PeriodicalId\":50363,\"journal\":{\"name\":\"Information Systems\",\"volume\":\"133 \",\"pages\":\"Article 102567\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-05-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306437925000511\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306437925000511","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

本文提出并分析了一种新的Vec2Doc方法，该方法可以将密集向量转换为稀疏整数向量，便于倒排索引在信息检索（IR）中的应用。深度学习和人工智能的指数级增长彻底改变了计算机视觉、自然语言处理和自动内容生成等领域的科学问题解决方式。这些进步也显著影响了信息检索，更好地理解自然语言和多模态内容分析导致更准确的信息检索。尽管有这些发展，现代红外主要依赖于深度神经网络潜在空间中密集向量的相似性评估。这种依赖性给在包含数十亿个向量的大型集合上执行相似性搜索带来了实质性的挑战。传统的红外方法采用倒排索引和向量空间模型，擅长处理稀疏向量，但处理密集向量效果不佳。Vec2Doc试图通过将密集向量转换为与传统倒排索引技术兼容的格式来填补这一空白。我们的初步实验评估表明，Vec2Doc是一种很有前途的解决方案，可以克服基于向量的IR固有的可扩展性问题，为高效、准确的大规模信息检索提供了一种替代方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Training-free sparse representations of dense vectors for scalable information retrieval

查看原文本刊更多论文

Training-free sparse representations of dense vectors for scalable information retrieval

In this paper, we propose and analyze Vec2Doc, a novel training-free method to transform dense vectors into sparse integer vectors, facilitating the use of inverted indexes for information retrieval (IR). The exponential growth of deep learning and artificial intelligence has revolutionized scientific problem-solving in areas such as computer vision, natural language processing, and automatic content generation. These advances have also significantly impacted IR, with a better understanding of natural language and multimodal content analysis leading to more accurate information retrieval. Despite these developments, modern IR relies primarily on the similarity evaluation of dense vectors from the latent spaces of deep neural networks. This dependence introduces substantial challenges in performing similarity searches on large collections containing billions of vectors. Traditional IR methods, which employ inverted indexes and vector space models, are adept at handling sparse vectors but do not work well with dense ones. Vec2Doc attempts to fill this gap by converting dense vectors into a format compatible with conventional inverted index techniques. Our preliminary experimental evaluations show that Vec2Doc is a promising solution to overcome the scalability problems inherent in vector-based IR, offering an alternative method for efficient and accurate large-scale information retrieval.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

2.70%

发文量

112

审稿时长

53 days

期刊介绍： Information systems are the software and hardware systems that support data-intensive applications. The journal Information Systems publishes articles concerning the design and implementation of languages, data models, process models, algorithms, software and hardware for information systems. Subject areas include data management issues as presented in the principal international database conferences (e.g., ACM SIGMOD/PODS, VLDB, ICDE and ICDT/EDBT) as well as data-related issues from the fields of data mining/machine learning, information retrieval coordinated with structured data, internet and cloud data management, business process management, web semantics, visual and audio information systems, scientific computing, and data science. Implementation papers having to do with massively parallel data management, fault tolerance in practice, and special purpose hardware for data-intensive systems are also welcome. Manuscripts from application domains, such as urban informatics, social and natural science, and Internet of Things, are also welcome. All papers should highlight innovative solutions to data management problems such as new data models, performance enhancements, and show how those innovations contribute to the goals of the application.