Generalized Weak Supervision for Neural Information Retrieval

IF 5.4 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Information Systems Pub Date : 2024-02-21 DOI:10.1145/3647639

Yen-Chieh Lien, Hamed Zamani, W. Bruce Croft

{"title":"Generalized Weak Supervision for Neural Information Retrieval","authors":"Yen-Chieh Lien, Hamed Zamani, W. Bruce Croft","doi":"10.1145/3647639","DOIUrl":null,"url":null,"abstract":"<p>Neural ranking models (NRMs) have demonstrated effective performance in several information retrieval (IR) tasks. However, training NRMs often requires large-scale training data, which is difficult and expensive to obtain. To address this issue, one can train NRMs via weak supervision, where a large dataset is automatically generated using an existing ranking model (called the weak labeler) for training NRMs. Weakly supervised NRMs can generalize from the observed data and significantly outperform the weak labeler. This paper generalizes this idea through an iterative re-labeling process, demonstrating that weakly supervised models can iteratively play the role of weak labeler and significantly improve ranking performance without using manually labeled data. The proposed Generalized Weak Supervision (GWS) solution is generic and orthogonal to the ranking model architecture. This paper offers four implementations of GWS: self-labeling, cross-labeling, joint cross- and self-labeling, and greedy multi-labeling. GWS also benefits from a query importance weighting mechanism based on query performance prediction methods to reduce noise in the generated training data. We further draw a theoretical connection between self-labeling and Expectation-Maximization. Our experiments on four retrieval benchmarks suggest that our implementations of GWS lead to substantial improvements compared to weak supervision if the weak labeler is sufficiently reliable.</p>","PeriodicalId":50936,"journal":{"name":"ACM Transactions on Information Systems","volume":"17 1","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2024-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3647639","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Neural ranking models (NRMs) have demonstrated effective performance in several information retrieval (IR) tasks. However, training NRMs often requires large-scale training data, which is difficult and expensive to obtain. To address this issue, one can train NRMs via weak supervision, where a large dataset is automatically generated using an existing ranking model (called the weak labeler) for training NRMs. Weakly supervised NRMs can generalize from the observed data and significantly outperform the weak labeler. This paper generalizes this idea through an iterative re-labeling process, demonstrating that weakly supervised models can iteratively play the role of weak labeler and significantly improve ranking performance without using manually labeled data. The proposed Generalized Weak Supervision (GWS) solution is generic and orthogonal to the ranking model architecture. This paper offers four implementations of GWS: self-labeling, cross-labeling, joint cross- and self-labeling, and greedy multi-labeling. GWS also benefits from a query importance weighting mechanism based on query performance prediction methods to reduce noise in the generated training data. We further draw a theoretical connection between self-labeling and Expectation-Maximization. Our experiments on four retrieval benchmarks suggest that our implementations of GWS lead to substantial improvements compared to weak supervision if the weak labeler is sufficiently reliable.

查看原文本刊更多论文

神经信息检索的广义弱监督

神经排序模型（NRMs）在多项信息检索（IR）任务中表现出了有效的性能。然而，训练 NRM 通常需要大规模的训练数据，而获取这些数据既困难又昂贵。为了解决这个问题，人们可以通过弱监督来训练 NRM，即使用现有的排名模型（称为弱标签器）自动生成一个大型数据集，用于训练 NRM。弱监督式 NRM 可以从观察到的数据中进行泛化，并明显优于弱标签器。本文通过迭代重标记过程推广了这一想法，证明弱监督模型可以迭代地扮演弱标记者的角色，并在不使用人工标记数据的情况下显著提高排名性能。本文提出的广义弱监督（GWS）解决方案是通用的，与排序模型架构是正交的。本文提供了四种 GWS 实现方法：自标注、交叉标注、交叉和自标注联合以及贪婪多标注。GWS 还得益于基于查询性能预测方法的查询重要性加权机制，以减少生成的训练数据中的噪声。我们还在自标注和期望最大化之间建立了理论联系。我们在四个检索基准上进行的实验表明，如果弱标签器足够可靠，我们的 GWS 实现与弱监督相比会有很大改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Information Systems 工程技术-计算机：信息系统

CiteScore

9.40

自引率

14.30%

发文量

165

审稿时长

>12 weeks

期刊介绍： The ACM Transactions on Information Systems (TOIS) publishes papers on information retrieval (such as search engines, recommender systems) that contain: new principled information retrieval models or algorithms with sound empirical validation; observational, experimental and/or theoretical studies yielding new insights into information retrieval or information seeking; accounts of applications of existing information retrieval techniques that shed light on the strengths and weaknesses of the techniques; formalization of new information retrieval or information seeking tasks and of methods for evaluating the performance on those tasks; development of content (text, image, speech, video, etc) analysis methods to support information retrieval and information seeking; development of computational models of user information preferences and interaction behaviors; creation and analysis of evaluation methodologies for information retrieval and information seeking; or surveys of existing work that propose a significant synthesis. The information retrieval scope of ACM Transactions on Information Systems (TOIS) appeals to industry practitioners for its wealth of creative ideas, and to academic researchers for its descriptions of their colleagues'' work.