Dense Text Retrieval based on Pretrained Language Models: A Survey

IF 9.1 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Information Systems Pub Date : 2023-12-18 DOI:10.1145/3637870

Wayne Xin Zhao, Jing Liu, Ruiyang Ren, Ji-Rong Wen

{"title":"Dense Text Retrieval based on Pretrained Language Models: A Survey","authors":"Wayne Xin Zhao, Jing Liu, Ruiyang Ren, Ji-Rong Wen","doi":"10.1145/3637870","DOIUrl":null,"url":null,"abstract":"<p>Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user’s queries in natural language. From heuristic-based retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn text representations and model the relevance matching. The recent success of pretrained language models (PLM) sheds light on developing more capable text retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the semantic representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is called <i>dense retrieval</i>, since it employs dense vectors to represent the texts. Considering the rapid progress on dense retrieval, this survey systematically reviews the recent progress on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related studies by four major aspects, including architecture, training, indexing and integration, and thoroughly summarize the mainstream techniques for each aspect. We extensively collect the recent advances on this topic, and include 300+ reference papers. To support our survey, we create a website for providing useful resources, and release a code repository for dense retrieval. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval.</p>","PeriodicalId":50936,"journal":{"name":"ACM Transactions on Information Systems","volume":"70 1","pages":""},"PeriodicalIF":9.1000,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3637870","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Text retrieval is a long-standing research topic on information seeking, where a system is required to return relevant information resources to user’s queries in natural language. From heuristic-based retrieval methods to learning-based ranking functions, the underlying retrieval models have been continually evolved with the ever-lasting technical innovation. To design effective retrieval models, a key point lies in how to learn text representations and model the relevance matching. The recent success of pretrained language models (PLM) sheds light on developing more capable text retrieval approaches by leveraging the excellent modeling capacity of PLMs. With powerful PLMs, we can effectively learn the semantic representations of queries and texts in the latent representation space, and further construct the semantic matching function between the dense vectors for relevance modeling. Such a retrieval approach is called dense retrieval, since it employs dense vectors to represent the texts. Considering the rapid progress on dense retrieval, this survey systematically reviews the recent progress on PLM-based dense retrieval. Different from previous surveys on dense retrieval, we take a new perspective to organize the related studies by four major aspects, including architecture, training, indexing and integration, and thoroughly summarize the mainstream techniques for each aspect. We extensively collect the recent advances on this topic, and include 300+ reference papers. To support our survey, we create a website for providing useful resources, and release a code repository for dense retrieval. This survey aims to provide a comprehensive, practical reference focused on the major progress for dense text retrieval.

查看原文本刊更多论文

基于预训练语言模型的密集文本检索：调查

文本检索是信息搜索领域的一个长期研究课题，系统需要根据用户的自然语言查询返回相关的信息资源。从基于启发式的检索方法到基于学习的排序功能，随着技术的不断创新，基础检索模型也在不断发展。要设计有效的检索模型，关键在于如何学习文本表征和建立相关性匹配模型。最近，预训练语言模型（PLM）取得了成功，这为我们利用 PLM 的出色建模能力开发更强大的文本检索方法提供了启示。利用功能强大的 PLM，我们可以有效地学习潜在表征空间中查询和文本的语义表征，并进一步构建密集向量之间的语义匹配函数，从而建立相关性模型。这种检索方法采用密集向量来表示文本，因此被称为密集检索。考虑到高密度检索的快速发展，本调查系统地回顾了基于 PLM 的高密度检索的最新进展。与以往的密集检索研究不同，我们从一个全新的视角出发，从架构、训练、索引和集成四个主要方面对相关研究进行了梳理，并对每个方面的主流技术进行了全面总结。我们广泛收集了该主题的最新进展，并收录了 300 多篇参考文献。为了支持我们的调查，我们创建了一个提供有用资源的网站，并发布了一个用于密集检索的代码库。本调查旨在为密集文本检索的主要进展提供全面、实用的参考。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

14.30%

发文量

165

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Information Systems (TOIS) publishes papers on information retrieval (such as search engines, recommender systems) that contain: new principled information retrieval models or algorithms with sound empirical validation; observational, experimental and/or theoretical studies yielding new insights into information retrieval or information seeking; accounts of applications of existing information retrieval techniques that shed light on the strengths and weaknesses of the techniques; formalization of new information retrieval or information seeking tasks and of methods for evaluating the performance on those tasks; development of content (text, image, speech, video, etc) analysis methods to support information retrieval and information seeking; development of computational models of user information preferences and interaction behaviors; creation and analysis of evaluation methodologies for information retrieval and information seeking; or surveys of existing work that propose a significant synthesis. The information retrieval scope of ACM Transactions on Information Systems (TOIS) appeals to industry practitioners for its wealth of creative ideas, and to academic researchers for its descriptions of their colleagues'' work.