从不可信的数据中学习

M. Charikar, J. Steinhardt, G. Valiant
{"title":"从不可信的数据中学习","authors":"M. Charikar, J. Steinhardt, G. Valiant","doi":"10.1145/3055399.3055491","DOIUrl":null,"url":null,"abstract":"The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - α)n points, is it possible to return a list of poly(1/α) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.","PeriodicalId":20615,"journal":{"name":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","volume":"5 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"258","resultStr":"{\"title\":\"Learning from untrusted data\",\"authors\":\"M. Charikar, J. Steinhardt, G. Valiant\",\"doi\":\"10.1145/3055399.3055491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - α)n points, is it possible to return a list of poly(1/α) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.\",\"PeriodicalId\":20615,\"journal\":{\"name\":\"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing\",\"volume\":\"5 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"258\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3055399.3055491\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3055399.3055491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 258

摘要

机器学习和统计学中的绝大多数理论结果都假设训练数据是待学习现象的可靠反映。同样,在实践中使用的大多数学习技术对大量有偏见或恶意数据的存在是脆弱的。受此启发,我们考虑了两种框架,用于在任意数据的显著部分存在的情况下研究估计、学习和优化。第一个框架是列表可解码学习,它询问是否有可能返回一个答案列表,使得至少有一个答案是准确的。例如,给定一个n个点的数据集,其中αn个点的未知子集是从感兴趣的分布中提取的,并且没有对剩余的(1 - α)n个点进行假设,是否有可能返回poly(1/α)答案列表?第二个框架,我们称之为半验证模型,它询问是否可以使用可信数据的小数据集(从有问题的分布中提取)来从更大但不可信的数据集(其中只有α-部分是从分布中提取的)中提取准确的信息。我们在这两种情况下都显示了强有力的积极结果,并提供了一种在非常一般的随机优化设置下进行鲁棒学习的算法。这个结果对于稳健地估计有界秒矩分布的平均值,稳健地学习这些分布的混合,以及稳健地在图的重要部分被对手扰动的随机图中发现种植分区具有直接的意义。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Learning from untrusted data
The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of αn points are drawn from a distribution of interest, and no assumptions are made about the remaining (1 - α)n points, is it possible to return a list of poly(1/α) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信