论信息检索的弱监督理论

Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval Pub Date : 2018-09-10 DOI:10.1145/3234944.3234968

Hamed Zamani, W. Bruce Croft

{"title":"论信息检索的弱监督理论","authors":"Hamed Zamani, W. Bruce Croft","doi":"10.1145/3234944.3234968","DOIUrl":null,"url":null,"abstract":"Neural network approaches have recently shown to be effective in several information retrieval (IR) tasks. However, neural approaches often require large volumes of training data to perform effectively, which is not always available. To mitigate the shortage of labeled data, training neural IR models with weak supervision has been recently proposed and received considerable attention in the literature. In weak supervision, an existing model automatically generates labels for a large set of unlabeled data, and a machine learning model is further trained on the generated \"weak\" data. Surprisingly, it has been shown in prior art that the trained neural model can outperform the weak labeler by a significant margin. Although these obtained improvements have been intuitively justified in previous work, the literature still lacks theoretical justification for the observed empirical findings. In this paper, we provide a theoretical insight into weak supervision for information retrieval, focusing on learning to rank. We model the weak supervision signal as a noisy channel that introduces noise to the correct ranking. Based on the risk minimization framework, we prove that given some sufficient constraints on the loss function, weak supervision is equivalent to supervised learning under uniform noise. We also find an upper bound for the empirical risk of weak supervision in case of non-uniform noise. Following the recent work on using multiple weak supervision signals to learn more accurate models, we find an information theoretic lower bound on the number of weak supervision signals required to guarantee an upper bound for the pairwise error probability. We empirically verify a set of presented theoretical findings, using synthetic and real weak supervision data.","PeriodicalId":193631,"journal":{"name":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":"{\"title\":\"On the Theory of Weak Supervision for Information Retrieval\",\"authors\":\"Hamed Zamani, W. Bruce Croft\",\"doi\":\"10.1145/3234944.3234968\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Neural network approaches have recently shown to be effective in several information retrieval (IR) tasks. However, neural approaches often require large volumes of training data to perform effectively, which is not always available. To mitigate the shortage of labeled data, training neural IR models with weak supervision has been recently proposed and received considerable attention in the literature. In weak supervision, an existing model automatically generates labels for a large set of unlabeled data, and a machine learning model is further trained on the generated \\\"weak\\\" data. Surprisingly, it has been shown in prior art that the trained neural model can outperform the weak labeler by a significant margin. Although these obtained improvements have been intuitively justified in previous work, the literature still lacks theoretical justification for the observed empirical findings. In this paper, we provide a theoretical insight into weak supervision for information retrieval, focusing on learning to rank. We model the weak supervision signal as a noisy channel that introduces noise to the correct ranking. Based on the risk minimization framework, we prove that given some sufficient constraints on the loss function, weak supervision is equivalent to supervised learning under uniform noise. We also find an upper bound for the empirical risk of weak supervision in case of non-uniform noise. Following the recent work on using multiple weak supervision signals to learn more accurate models, we find an information theoretic lower bound on the number of weak supervision signals required to guarantee an upper bound for the pairwise error probability. We empirically verify a set of presented theoretical findings, using synthetic and real weak supervision data.\",\"PeriodicalId\":193631,\"journal\":{\"name\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"37\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3234944.3234968\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3234944.3234968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

摘要

神经网络方法最近在一些信息检索(IR)任务中被证明是有效的。然而，神经方法通常需要大量的训练数据才能有效地执行，这并不总是可用的。为了缓解标记数据的不足，最近提出了带有弱监督的训练神经IR模型，并受到了文献的广泛关注。在弱监督中，现有模型自动为大量未标记的数据生成标签，并在生成的“弱”数据上进一步训练机器学习模型。令人惊讶的是，在现有技术中已经显示，经过训练的神经模型可以以显着的幅度优于弱标记器。虽然这些获得的改进在以前的工作中已经被直观地证明了，但文献仍然缺乏对观察到的实证结果的理论证明。在本文中，我们对信息检索的弱监督提供了一个理论见解，重点是学习排序。我们将弱监督信号建模为一个带噪声的信道，将噪声引入到正确的排序中。基于风险最小化框架，证明了在给定损失函数的充分约束条件下，弱监督等价于均匀噪声下的监督学习。我们还找到了在非均匀噪声情况下弱监督的经验风险的上界。根据最近使用多个弱监督信号来学习更精确的模型的工作，我们找到了保证成对错误概率上界所需的弱监督信号数量的信息论下界。本文利用综合的、真实的弱监管数据，对本文提出的一组理论结果进行了实证验证。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

On the Theory of Weak Supervision for Information Retrieval

Neural network approaches have recently shown to be effective in several information retrieval (IR) tasks. However, neural approaches often require large volumes of training data to perform effectively, which is not always available. To mitigate the shortage of labeled data, training neural IR models with weak supervision has been recently proposed and received considerable attention in the literature. In weak supervision, an existing model automatically generates labels for a large set of unlabeled data, and a machine learning model is further trained on the generated "weak" data. Surprisingly, it has been shown in prior art that the trained neural model can outperform the weak labeler by a significant margin. Although these obtained improvements have been intuitively justified in previous work, the literature still lacks theoretical justification for the observed empirical findings. In this paper, we provide a theoretical insight into weak supervision for information retrieval, focusing on learning to rank. We model the weak supervision signal as a noisy channel that introduces noise to the correct ranking. Based on the risk minimization framework, we prove that given some sufficient constraints on the loss function, weak supervision is equivalent to supervised learning under uniform noise. We also find an upper bound for the empirical risk of weak supervision in case of non-uniform noise. Following the recent work on using multiple weak supervision signals to learn more accurate models, we find an information theoretic lower bound on the number of weak supervision signals required to guarantee an upper bound for the pairwise error probability. We empirically verify a set of presented theoretical findings, using synthetic and real weak supervision data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval

自引率

0.00%

发文量