Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation

Negar Arabzadeh, A. Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, E. Bagheri
{"title":"Matches Made in Heaven: Toolkit and Large-Scale Datasets for Supervised Query Reformulation","authors":"Negar Arabzadeh, A. Bigdeli, Shirin Seyedsalehi, Morteza Zihayat, E. Bagheri","doi":"10.1145/3459637.3482009","DOIUrl":null,"url":null,"abstract":"Researchers have already shown that it is possible to improve retrieval effectiveness through the systematic reformulation of users' queries. Traditionally, most query reformulation techniques relied on unsupervised approaches such as query expansion through pseudo-relevance feedback. More recently and with the increasing effectiveness of neural sequence-to-sequence architectures, the problem of query reformulation has been studied as a supervised query translation problem, which learns to rewrite a query into a more effective alternative. While quite effective in practice, such supervised query reformulation methods require a large number of training instances. In this paper, we present three large-scale query reformulation datasets, namely Diamond, Platinum and Gold datasets, based on the queries in the MS MARCO dataset. The Diamond dataset consists of over 188,000 query pairs where the original source query is matched with an alternative query that has a perfect retrieval effectiveness (an average precision of 1). To the best of our knowledge, this is the first set of datasets for supervised query reformulation that offers perfect query reformulations for a large number of queries. The implementation of our fully automated tool, which is based on a transformer architecture, and our three datasets are made publicly available. We also establish a neural query reformulation baseline performance on our datasets by reporting the performance of strong neural query reformulation baselines. It is our belief that our datasets will significantly impact the development of supervised query reformulation methods in the future.","PeriodicalId":405296,"journal":{"name":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3459637.3482009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Researchers have already shown that it is possible to improve retrieval effectiveness through the systematic reformulation of users' queries. Traditionally, most query reformulation techniques relied on unsupervised approaches such as query expansion through pseudo-relevance feedback. More recently and with the increasing effectiveness of neural sequence-to-sequence architectures, the problem of query reformulation has been studied as a supervised query translation problem, which learns to rewrite a query into a more effective alternative. While quite effective in practice, such supervised query reformulation methods require a large number of training instances. In this paper, we present three large-scale query reformulation datasets, namely Diamond, Platinum and Gold datasets, based on the queries in the MS MARCO dataset. The Diamond dataset consists of over 188,000 query pairs where the original source query is matched with an alternative query that has a perfect retrieval effectiveness (an average precision of 1). To the best of our knowledge, this is the first set of datasets for supervised query reformulation that offers perfect query reformulations for a large number of queries. The implementation of our fully automated tool, which is based on a transformer architecture, and our three datasets are made publicly available. We also establish a neural query reformulation baseline performance on our datasets by reporting the performance of strong neural query reformulation baselines. It is our belief that our datasets will significantly impact the development of supervised query reformulation methods in the future.
天作之合:监督查询重构的工具箱和大规模数据集
研究人员已经表明,有可能通过系统地重新表述用户的查询来提高检索效率。传统上,大多数查询重构技术依赖于无监督的方法,例如通过伪相关反馈进行查询扩展。最近,随着神经序列到序列架构的有效性不断提高,查询重新表述问题被研究为一个监督查询翻译问题,该问题学习将查询重写为更有效的替代方案。这种监督式查询重构方法虽然在实践中非常有效,但需要大量的训练实例。本文基于MS MARCO数据集中的查询,提出了Diamond、Platinum和Gold三个大规模查询重构数据集。Diamond数据集由超过188,000个查询对组成,其中原始源查询与具有完美检索效率(平均精度为1)的替代查询相匹配。据我们所知,这是第一组用于监督查询重新表述的数据集,它为大量查询提供了完美的查询重新表述。我们的完全自动化的工具的实现是基于一个转换器架构的,我们的三个数据集是公开可用的。我们还通过报告强神经查询重构基线的性能,在我们的数据集上建立了神经查询重构基线性能。我们相信,我们的数据集将在未来显著影响监督查询重构方法的发展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信