{"title":"重新审视词袋文档表示法,利用变换器实现高效排序","authors":"David Rau, Mostafa Dehghani, Jaap Kamps","doi":"10.1145/3640460","DOIUrl":null,"url":null,"abstract":"<p>Modern transformer-based information retrieval models achieve state-of-the-art performance across various benchmarks. The self-attention of the transformer models is a powerful mechanism to contextualize terms over the whole input but quickly becomes prohibitively expensive for long input as required in document retrieval. Instead of focusing on the model itself to improve efficiency, this paper explores different bag of words document representations that encode full documents by only a fraction of their characteristic terms, allowing us to control and reduce the input length. We experiment with various models for document retrieval on MS MARCO data, as well as zero-shot document retrieval on Robust04, and show large gains in efficiency while retaining reasonable effectiveness. Inference time efficiency gains are both lowering the time and memory complexity in a controllable way, allowing for further trading off memory footprint and query latency. More generally, this line of research connects traditional IR models with neural “NLP” models and offers novel ways to explore the space between (efficient, but less effective) traditional rankers and (effective, but less efficient) neural rankers elegantly.</p>","PeriodicalId":50936,"journal":{"name":"ACM Transactions on Information Systems","volume":"60 1","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2024-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Revisiting Bag of Words Document Representations for Efficient Ranking with Transformers\",\"authors\":\"David Rau, Mostafa Dehghani, Jaap Kamps\",\"doi\":\"10.1145/3640460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Modern transformer-based information retrieval models achieve state-of-the-art performance across various benchmarks. The self-attention of the transformer models is a powerful mechanism to contextualize terms over the whole input but quickly becomes prohibitively expensive for long input as required in document retrieval. Instead of focusing on the model itself to improve efficiency, this paper explores different bag of words document representations that encode full documents by only a fraction of their characteristic terms, allowing us to control and reduce the input length. We experiment with various models for document retrieval on MS MARCO data, as well as zero-shot document retrieval on Robust04, and show large gains in efficiency while retaining reasonable effectiveness. Inference time efficiency gains are both lowering the time and memory complexity in a controllable way, allowing for further trading off memory footprint and query latency. More generally, this line of research connects traditional IR models with neural “NLP” models and offers novel ways to explore the space between (efficient, but less effective) traditional rankers and (effective, but less efficient) neural rankers elegantly.</p>\",\"PeriodicalId\":50936,\"journal\":{\"name\":\"ACM Transactions on Information Systems\",\"volume\":\"60 1\",\"pages\":\"\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2024-02-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Information Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3640460\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3640460","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
基于转换器的现代信息检索模型在各种基准测试中都达到了最先进的性能。转换器模型的自关注是一种强大的机制,可将整个输入中的术语上下文化,但对于文档检索中所需的长输入,这种机制很快就会变得昂贵得令人望而却步。为了提高效率,本文没有把重点放在模型本身,而是探索了不同的词袋文档表示法,这些表示法只用部分特征词对完整文档进行编码,从而使我们能够控制和减少输入长度。我们在 MS MARCO 数据的文档检索以及 Robust04 的零次文档检索中尝试了各种模型,结果表明在保持合理有效性的同时,效率也得到了大幅提高。推理时间效率的提高以一种可控的方式降低了时间和内存复杂性,从而可以进一步权衡内存占用和查询延迟。更广泛地说,这项研究将传统的红外模型与神经 "NLP "模型联系起来,为探索(高效但效率较低)传统排序器和(有效但效率较低)神经排序器之间的空间提供了新的方法。
Revisiting Bag of Words Document Representations for Efficient Ranking with Transformers
Modern transformer-based information retrieval models achieve state-of-the-art performance across various benchmarks. The self-attention of the transformer models is a powerful mechanism to contextualize terms over the whole input but quickly becomes prohibitively expensive for long input as required in document retrieval. Instead of focusing on the model itself to improve efficiency, this paper explores different bag of words document representations that encode full documents by only a fraction of their characteristic terms, allowing us to control and reduce the input length. We experiment with various models for document retrieval on MS MARCO data, as well as zero-shot document retrieval on Robust04, and show large gains in efficiency while retaining reasonable effectiveness. Inference time efficiency gains are both lowering the time and memory complexity in a controllable way, allowing for further trading off memory footprint and query latency. More generally, this line of research connects traditional IR models with neural “NLP” models and offers novel ways to explore the space between (efficient, but less effective) traditional rankers and (effective, but less efficient) neural rankers elegantly.
期刊介绍:
The ACM Transactions on Information Systems (TOIS) publishes papers on information retrieval (such as search engines, recommender systems) that contain:
new principled information retrieval models or algorithms with sound empirical validation;
observational, experimental and/or theoretical studies yielding new insights into information retrieval or information seeking;
accounts of applications of existing information retrieval techniques that shed light on the strengths and weaknesses of the techniques;
formalization of new information retrieval or information seeking tasks and of methods for evaluating the performance on those tasks;
development of content (text, image, speech, video, etc) analysis methods to support information retrieval and information seeking;
development of computational models of user information preferences and interaction behaviors;
creation and analysis of evaluation methodologies for information retrieval and information seeking; or
surveys of existing work that propose a significant synthesis.
The information retrieval scope of ACM Transactions on Information Systems (TOIS) appeals to industry practitioners for its wealth of creative ideas, and to academic researchers for its descriptions of their colleagues'' work.