Dense Passage Retrieval: Architectures and Augmentation Methods

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2023-07-18 DOI:10.1145/3539618.3591796

Thilina C. Rajapakse

{"title":"Dense Passage Retrieval: Architectures and Augmentation Methods","authors":"Thilina C. Rajapakse","doi":"10.1145/3539618.3591796","DOIUrl":null,"url":null,"abstract":"The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval [1]. But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries. Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets [2]. Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi [3] dataset. Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passages retrieved using the first representation. Here, a second stage model is not required and both representations are generated in a single forward pass from the dual encoder. I aim to answer the following research questions in this work: (RQ3.1), Can the same model be trained to effectively generate two representations intended for two uses? RQ3.2 Can the retrieval quality of the model be improved by simultaneously performing retrieval and reranking? (RQ3.3 What is the tradeoff between retrieval quality vs. latency and compute resource efficiency for the proposed method vs. a two stage retriever? I expect that my proposed architecture would improve the dual encoder retrieval quality without sacrificing throughput or needing more computational resources.","PeriodicalId":425056,"journal":{"name":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3539618.3591796","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The dual-encoder model is a dense retrieval architecture, consisting of two encoder models, that has surpassed traditional sparse retrieval methods for open-domain retrieval [1]. But, room exists for improvement, particularly when dense retrievers are exposed to unseen passages or queries. Considering out-of-domain queries, i.e., queries originating from domains other than the one the model was trained on, the loss in accuracy may be significant. A main factor for this is the mismatch in the information available to the context encoder and the query encoder during training. Common retireval training datasets contain an overwhelming majority of passages with one query from a passage. I hypothesize that this could lead the dual-encoder model, particularly the passage encoder, to overfit to a single potential query from a given passage to the detriment of out-of-domain performance. Based on this, I seek to answer the following research question: (RQ1.1) Does training a DPR model on data containing multiple queries per passage improve the generalizability of the model? To answer RQ1.1, I build generated datasets that have multiple queries for most passages, and compare dense passage retriever models trained on these datasets against models trained on (mostly) single query per passage datasets. I show that training on passages with multiple queries leads to models that generalize better to out-of-distribution and out-of-domain test datasets [2]. Language can be considered another domain in the context of a dense retrieval. Training a dense retrieval model is especially challenging in languages other than English due to the scarcity of training data. I propose a novel training technique, clustered training, aimed at improving the retrieval quality of dense retrievers, especially in out-of-distribution and zero-shot settings. I address the following research questions: (RQ2.1)Does clustered training improve the effectiveness of multilingual DPR models on in-distribution data? (RQ2.2) Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.2 Does clustered training improve the effectiveness of multilingual DPR models on out-of-distribution data from languages that it is trained on? (RQ2.3) Does clustered training help multilingual DPR models to generalize to new languages (zero-shot)? I show that clustered training improves the out-of-distribution and zero-shot performance of a DPR model without a clear loss in in-distribution performance using the Mr. TyDi [3] dataset. Finally, I propose a modified dual-encoder architecture that can perform both retrieval and reranking with the same model in a single forward pass. While dual encoder models can surpass traditional sparse retrieval methods, they lag behind two stage retrieval pipelines in retrieval quality. I propose a modification to the dual encoder model where a second representation is used to rerank the passages retrieved using the first representation. Here, a second stage model is not required and both representations are generated in a single forward pass from the dual encoder. I aim to answer the following research questions in this work: (RQ3.1), Can the same model be trained to effectively generate two representations intended for two uses? RQ3.2 Can the retrieval quality of the model be improved by simultaneously performing retrieval and reranking? (RQ3.3 What is the tradeoff between retrieval quality vs. latency and compute resource efficiency for the proposed method vs. a two stage retriever? I expect that my proposed architecture would improve the dual encoder retrieval quality without sacrificing throughput or needing more computational resources.

查看原文本刊更多论文

密集通道检索:体系结构和增强方法

双编码器模型是由两个编码器模型组成的密集检索体系结构，在开放域检索[1]方面超越了传统的稀疏检索方法。但是，改进的空间是存在的，特别是当密集的寻回犬接触到看不见的段落或查询时。考虑到域外查询，即来自模型训练的域以外的域的查询，准确性的损失可能是显著的。造成这种情况的一个主要因素是在训练期间上下文编码器和查询编码器可用的信息不匹配。常见的检索训练数据集包含绝大多数的段落，其中一个段落只有一个查询。我假设这可能导致双编码器模型，特别是通道编码器，过度拟合来自给定通道的单个潜在查询，从而损害域外性能。基于此，我试图回答以下研究问题:(RQ1.1)在每段包含多个查询的数据上训练DPR模型是否可以提高模型的泛化性?为了回答RQ1.1，我构建了对大多数段落有多个查询的生成数据集，并将在这些数据集上训练的密集段落检索器模型与在每个段落数据集(主要)单个查询上训练的模型进行比较。我表明，对具有多个查询的段落进行训练会导致模型更好地泛化到分布外和域外测试数据集[2]。在密集检索的上下文中，语言可以被视为另一个领域。由于训练数据的稀缺性，在英语以外的语言中训练密集检索模型尤其具有挑战性。我提出了一种新的训练技术，聚类训练，旨在提高密集检索器的检索质量，特别是在非分布和零射击设置。我解决了以下研究问题:(RQ2.1)聚类训练是否提高了多语言DPR模型对分布内数据的有效性?(RQ2.2)聚类训练是否提高了多语言DPR模型对其所训练语言的分布外数据的有效性?聚类训练是否提高了多语言DPR模型对其所训练语言的分布外数据的有效性?(RQ2.3)聚类训练是否有助于多语言DPR模型泛化到新的语言(零概率)?我使用Mr. TyDi[3]数据集证明了聚类训练提高了DPR模型的分布外和零射击性能，而分布内性能没有明显的损失。最后，我提出了一种改进的双编码器架构，该架构可以在单个前向传递中使用相同的模型执行检索和重新排序。虽然双编码器模型可以超越传统的稀疏检索方法，但在检索质量上落后于两阶段检索管道。我建议对双编码器模型进行修改，其中使用第二种表示来重新排序使用第一种表示检索的段落。在这里，不需要第二阶段模型，两种表示都是在双编码器的单个前向传递中生成的。我的目标是在这项工作中回答以下研究问题:(RQ3.1)，是否可以训练相同的模型来有效地生成用于两种用途的两种表示?RQ3.2同时进行检索和重新排序是否可以提高模型的检索质量?对于所提出的方法，与两阶段检索器相比，检索质量、延迟和计算资源效率之间的权衡是什么?我希望我提出的架构能够在不牺牲吞吐量或需要更多计算资源的情况下提高双编码器检索质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量