Addressing Non-Representative Surveys using Multiple Instance Learning

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining Pub Date : 2021-08-14 DOI:10.1145/3447548.3467109

Yaniv Katz, O. Vainas

{"title":"Addressing Non-Representative Surveys using Multiple Instance Learning","authors":"Yaniv Katz, O. Vainas","doi":"10.1145/3447548.3467109","DOIUrl":null,"url":null,"abstract":"In recent years, non representative survey sampling and non response bias constitute major obstacles in obtaining a reliable population quantity estimate from finite survey samples. As such, researchers have been focusing on identifying methods to resolve these biases. In this paper, we look at this well known problem from a fresh perspective, and formulate it as a learning problem. To meet this challenge, we suggest solving the learning problem using a multiple instance learning (MIL) paradigm. We devise two different MIL based neural network topologies, each based on a different implementation of an attention pooling layer. These models are trained to accurately infer the population quantity of interest even when facing a biased sample. To the best of our knowledge, this is the first time MIL has ever been suggested as a solution to this problem. In contrast to commonly used statistical methods, this approach can be accomplished without having to collect sensitive personal data of the respondents and without having to access population level statistics of the same sensitive data. To validate the effectiveness of our approaches, we test them on a real-world movie rating dataset which is used to mimic a biased survey by experimentally contaminating it with different kinds of survey bias. We show that our suggested topologies outperform other MIL architectures, and are able to partly counter the adverse effect of biased sampling on the estimation quality. We also demonstrate how these methods can be easily adapted to perform well even when part of the survey is based on a small number of respondents.","PeriodicalId":421090,"journal":{"name":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3447548.3467109","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, non representative survey sampling and non response bias constitute major obstacles in obtaining a reliable population quantity estimate from finite survey samples. As such, researchers have been focusing on identifying methods to resolve these biases. In this paper, we look at this well known problem from a fresh perspective, and formulate it as a learning problem. To meet this challenge, we suggest solving the learning problem using a multiple instance learning (MIL) paradigm. We devise two different MIL based neural network topologies, each based on a different implementation of an attention pooling layer. These models are trained to accurately infer the population quantity of interest even when facing a biased sample. To the best of our knowledge, this is the first time MIL has ever been suggested as a solution to this problem. In contrast to commonly used statistical methods, this approach can be accomplished without having to collect sensitive personal data of the respondents and without having to access population level statistics of the same sensitive data. To validate the effectiveness of our approaches, we test them on a real-world movie rating dataset which is used to mimic a biased survey by experimentally contaminating it with different kinds of survey bias. We show that our suggested topologies outperform other MIL architectures, and are able to partly counter the adverse effect of biased sampling on the estimation quality. We also demonstrate how these methods can be easily adapted to perform well even when part of the survey is based on a small number of respondents.

查看原文本刊更多论文

使用多实例学习解决非代表性调查

近年来，非代表性的调查抽样和非响应偏差是在有限的调查样本中获得可靠的人口数量估计的主要障碍。因此，研究人员一直致力于寻找解决这些偏见的方法。在本文中，我们从一个新的角度来看待这个众所周知的问题，并将其表述为一个学习问题。为了应对这一挑战，我们建议使用多实例学习(MIL)范式来解决学习问题。我们设计了两种不同的基于MIL的神经网络拓扑，每一种都基于注意力池层的不同实现。这些模型经过训练，即使面对有偏差的样本，也能准确地推断出感兴趣的总体数量。据我们所知，这是MIL第一次被建议作为这个问题的解决方案。与常用的统计方法相比，这种方法无需收集受访者的敏感个人数据，也无需访问相同敏感数据的总体统计数据。为了验证我们方法的有效性，我们在一个真实世界的电影评级数据集上进行了测试，该数据集被用来模拟有偏见的调查，实验中用不同类型的调查偏见污染了它。我们表明，我们建议的拓扑优于其他MIL架构，并且能够部分地抵消有偏差采样对估计质量的不利影响。我们还演示了这些方法如何能够很容易地适应，即使部分调查是基于少量的受访者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining

自引率

0.00%

发文量