Training Data Optimization for Pairwise Learning to Rank

Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval Pub Date : 2020-09-14 DOI:10.1145/3409256.3409824

Hojae Han, Seung-won Hwang, Young-In Song, Siyeon Kim

{"title":"Training Data Optimization for Pairwise Learning to Rank","authors":"Hojae Han, Seung-won Hwang, Young-In Song, Siyeon Kim","doi":"10.1145/3409256.3409824","DOIUrl":null,"url":null,"abstract":"This paper studies data optimization for Learning to Rank (LtR), by dropping training labels to increase ranking accuracy. Our work is inspired by data dropout, showing some training data do not positively influence learning and are better dropped out, despite a common belief that a larger training dataset is beneficial. Our main contribution is to extend this intuition for noisy- and semi- supervised LtR scenarios: some human annotations can be noisy or out-of-date, and so are machine-generated pseudo-labels in semi- supervised scenarios. Dropping out such unreliable labels would contribute to both scenarios. State-of-the-arts propose Influence Function (IF) for estimating how each training instance affects learn- ing, and we identify and overcome two challenges specific to LtR. 1) Non-convex ranking functions violate the assumptions required for the robustness of IF estimation. 2) The pairwise learning of LtR incurs quadratic estimation overhead. Our technical contributions are addressing these challenges: First, we revise estimation and data optimization to accommodate reduced reliability; Second, we devise a group-wise estimation, reducing cost yet keeping accuracy high. We validate the effectiveness of our approach in a wide range of ad-hoc information retrieval benchmarks and real-life search engine datasets in both noisy- and semi-supervised scenarios.","PeriodicalId":430907,"journal":{"name":"Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3409256.3409824","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

This paper studies data optimization for Learning to Rank (LtR), by dropping training labels to increase ranking accuracy. Our work is inspired by data dropout, showing some training data do not positively influence learning and are better dropped out, despite a common belief that a larger training dataset is beneficial. Our main contribution is to extend this intuition for noisy- and semi- supervised LtR scenarios: some human annotations can be noisy or out-of-date, and so are machine-generated pseudo-labels in semi- supervised scenarios. Dropping out such unreliable labels would contribute to both scenarios. State-of-the-arts propose Influence Function (IF) for estimating how each training instance affects learn- ing, and we identify and overcome two challenges specific to LtR. 1) Non-convex ranking functions violate the assumptions required for the robustness of IF estimation. 2) The pairwise learning of LtR incurs quadratic estimation overhead. Our technical contributions are addressing these challenges: First, we revise estimation and data optimization to accommodate reduced reliability; Second, we devise a group-wise estimation, reducing cost yet keeping accuracy high. We validate the effectiveness of our approach in a wide range of ad-hoc information retrieval benchmarks and real-life search engine datasets in both noisy- and semi-supervised scenarios.

查看原文本刊更多论文

成对学习排序的训练数据优化

本文研究了学习排序(LtR)的数据优化，通过删除训练标签来提高排序精度。我们的工作受到数据丢弃的启发，显示一些训练数据对学习没有积极影响，最好是丢弃，尽管人们普遍认为更大的训练数据集是有益的。我们的主要贡献是将这种直觉扩展到有噪声和半监督的LtR场景:一些人工注释可能是有噪声的或过时的，在半监督的场景中机器生成的伪标签也是如此。放弃这些不可靠的标签将有助于这两种情况。最先进的方法提出了影响函数(IF)来估计每个训练实例对学习的影响，我们识别并克服了LtR特有的两个挑战。1)非凸排序函数违反了IF估计鲁棒性所需的假设。2) LtR的两两学习产生二次估计开销。我们的技术贡献是解决这些挑战:首先，我们修改估计和数据优化以适应降低的可靠性;其次，我们设计了一种分组估计，降低了成本，同时保持了较高的准确性。我们在噪声和半监督场景下的广泛的临时信息检索基准和现实生活中的搜索引擎数据集中验证了我们方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval

自引率

0.00%

发文量