解释最近邻方法在预测中的成功

George H. Chen, D. Shah
{"title":"解释最近邻方法在预测中的成功","authors":"George H. Chen, D. Shah","doi":"10.1561/2200000064","DOIUrl":null,"url":null,"abstract":"Many modern methods for prediction leverage nearest neighborsearch to find past training examples most similar toa test example, an idea that dates back in text to at leastthe 11th century and has stood the test of time. This monographaims to explain the success of these methods, both intheory, for which we cover foundational nonasymptotic statisticalguarantees on nearest-neighbor-based regression andclassification, and in practice, for which we gather prominentmethods for approximate nearest neighbor search thathave been essential to scaling prediction systems reliant onnearest neighbor analysis to handle massive datasets. Furthermore,we discuss connections to learning distances foruse with nearest neighbor methods, including how randomdecision trees and ensemble methods learn nearest neighborstructure, as well as recent developments in crowdsourcingand graphons.In terms of theory, our focus is on nonasymptotic statisticalguarantees, which we state in the form of how many trainingdata and what algorithm parameters ensure that a nearestneighbor prediction method achieves a user-specified errortolerance. We begin with the most general of such resultsfor nearest neighbor and related kernel regression and classificationin general metric spaces. In such settings in whichwe assume very little structure, what enables successful predictionis smoothness in the function being estimated forregression, and a low probability of landing near the decisionboundary for classification. In practice, these conditionscould be difficult to verify empirically for a real dataset. Wethen cover recent theoretical guarantees on nearest neighborprediction in the three case studies of time series forecasting,recommending products to people over time, and delineatinghuman organs in medical images by looking at imagepatches. In these case studies, clustering structure, whichis easier to verify in data and more readily interpretable bypractitioners, enables successful prediction.","PeriodicalId":431372,"journal":{"name":"Found. Trends Mach. Learn.","volume":"157 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"130","resultStr":"{\"title\":\"Explaining the Success of Nearest Neighbor Methods in Prediction\",\"authors\":\"George H. Chen, D. Shah\",\"doi\":\"10.1561/2200000064\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many modern methods for prediction leverage nearest neighborsearch to find past training examples most similar toa test example, an idea that dates back in text to at leastthe 11th century and has stood the test of time. This monographaims to explain the success of these methods, both intheory, for which we cover foundational nonasymptotic statisticalguarantees on nearest-neighbor-based regression andclassification, and in practice, for which we gather prominentmethods for approximate nearest neighbor search thathave been essential to scaling prediction systems reliant onnearest neighbor analysis to handle massive datasets. Furthermore,we discuss connections to learning distances foruse with nearest neighbor methods, including how randomdecision trees and ensemble methods learn nearest neighborstructure, as well as recent developments in crowdsourcingand graphons.In terms of theory, our focus is on nonasymptotic statisticalguarantees, which we state in the form of how many trainingdata and what algorithm parameters ensure that a nearestneighbor prediction method achieves a user-specified errortolerance. We begin with the most general of such resultsfor nearest neighbor and related kernel regression and classificationin general metric spaces. In such settings in whichwe assume very little structure, what enables successful predictionis smoothness in the function being estimated forregression, and a low probability of landing near the decisionboundary for classification. In practice, these conditionscould be difficult to verify empirically for a real dataset. Wethen cover recent theoretical guarantees on nearest neighborprediction in the three case studies of time series forecasting,recommending products to people over time, and delineatinghuman organs in medical images by looking at imagepatches. In these case studies, clustering structure, whichis easier to verify in data and more readily interpretable bypractitioners, enables successful prediction.\",\"PeriodicalId\":431372,\"journal\":{\"name\":\"Found. Trends Mach. Learn.\",\"volume\":\"157 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"130\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Found. Trends Mach. Learn.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1561/2200000064\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Found. Trends Mach. Learn.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1561/2200000064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 130

摘要

许多现代预测方法利用最近邻搜索来找到与测试示例最相似的过去训练示例,这种想法至少可以追溯到11世纪,并且经受住了时间的考验。本专著旨在解释这些方法的成功,在理论上,我们涵盖了基于最近邻的回归和分类的基本非渐近统计保证,在实践中,我们收集了近似最近邻搜索的突出方法,这对于依赖最近邻分析来处理大量数据集的缩放预测系统至关重要。此外,我们讨论了与最近邻方法的学习距离的联系,包括随机决策树和集成方法如何学习最近邻结构,以及众包和图论的最新发展。在理论方面,我们的重点是非渐近统计保证,我们以多少训练数据和什么算法参数的形式声明,以确保最近邻预测方法达到用户指定的容错。我们从一般度量空间中最近邻和相关核回归和分类的最一般的结果开始。在这样的设置中,我们假设很少的结构,使成功预测的是回归估计函数的平滑性,以及在分类决策边界附近着陆的低概率。在实践中,这些条件可能很难对真实数据集进行经验验证。然后,我们在时间序列预测、向人们推荐产品以及通过查看图像补丁在医学图像中描绘人体器官的三个案例研究中介绍了最近的理论保证。在这些案例研究中,更容易在数据中验证并且更容易被从业者解释的聚类结构能够实现成功的预测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Explaining the Success of Nearest Neighbor Methods in Prediction
Many modern methods for prediction leverage nearest neighborsearch to find past training examples most similar toa test example, an idea that dates back in text to at leastthe 11th century and has stood the test of time. This monographaims to explain the success of these methods, both intheory, for which we cover foundational nonasymptotic statisticalguarantees on nearest-neighbor-based regression andclassification, and in practice, for which we gather prominentmethods for approximate nearest neighbor search thathave been essential to scaling prediction systems reliant onnearest neighbor analysis to handle massive datasets. Furthermore,we discuss connections to learning distances foruse with nearest neighbor methods, including how randomdecision trees and ensemble methods learn nearest neighborstructure, as well as recent developments in crowdsourcingand graphons.In terms of theory, our focus is on nonasymptotic statisticalguarantees, which we state in the form of how many trainingdata and what algorithm parameters ensure that a nearestneighbor prediction method achieves a user-specified errortolerance. We begin with the most general of such resultsfor nearest neighbor and related kernel regression and classificationin general metric spaces. In such settings in whichwe assume very little structure, what enables successful predictionis smoothness in the function being estimated forregression, and a low probability of landing near the decisionboundary for classification. In practice, these conditionscould be difficult to verify empirically for a real dataset. Wethen cover recent theoretical guarantees on nearest neighborprediction in the three case studies of time series forecasting,recommending products to people over time, and delineatinghuman organs in medical images by looking at imagepatches. In these case studies, clustering structure, whichis easier to verify in data and more readily interpretable bypractitioners, enables successful prediction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信