解释最近邻方法在预测中的成功

Found. Trends Mach. Learn. Pub Date : 2018-05-31 DOI:10.1561/2200000064

George H. Chen, D. Shah

{"title":"解释最近邻方法在预测中的成功","authors":"George H. Chen, D. Shah","doi":"10.1561/2200000064","DOIUrl":null,"url":null,"abstract":"Many modern methods for prediction leverage nearest neighborsearch to find past training examples most similar toa test example, an idea that dates back in text to at leastthe 11th century and has stood the test of time. This monographaims to explain the success of these methods, both intheory, for which we cover foundational nonasymptotic statisticalguarantees on nearest-neighbor-based regression andclassification, and in practice, for which we gather prominentmethods for approximate nearest neighbor search thathave been essential to scaling prediction systems reliant onnearest neighbor analysis to handle massive datasets. Furthermore,we discuss connections to learning distances foruse with nearest neighbor methods, including how randomdecision trees and ensemble methods learn nearest neighborstructure, as well as recent developments in crowdsourcingand graphons.In terms of theory, our focus is on nonasymptotic statisticalguarantees, which we state in the form of how many trainingdata and what algorithm parameters ensure that a nearestneighbor prediction method achieves a user-specified errortolerance. We begin with the most general of such resultsfor nearest neighbor and related kernel regression and classificationin general metric spaces. In such settings in whichwe assume very little structure, what enables successful predictionis smoothness in the function being estimated forregression, and a low probability of landing near the decisionboundary for classification. In practice, these conditionscould be difficult to verify empirically for a real dataset. Wethen cover recent theoretical guarantees on nearest neighborprediction in the three case studies of time series forecasting,recommending products to people over time, and delineatinghuman organs in medical images by looking at imagepatches. In these case studies, clustering structure, whichis easier to verify in data and more readily interpretable bypractitioners, enables successful prediction.","PeriodicalId":431372,"journal":{"name":"Found. Trends Mach. Learn.","volume":"157 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"130","resultStr":"{\"title\":\"Explaining the Success of Nearest Neighbor Methods in Prediction\",\"authors\":\"George H. Chen, D. Shah\",\"doi\":\"10.1561/2200000064\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many modern methods for prediction leverage nearest neighborsearch to find past training examples most similar toa test example, an idea that dates back in text to at leastthe 11th century and has stood the test of time. This monographaims to explain the success of these methods, both intheory, for which we cover foundational nonasymptotic statisticalguarantees on nearest-neighbor-based regression andclassification, and in practice, for which we gather prominentmethods for approximate nearest neighbor search thathave been essential to scaling prediction systems reliant onnearest neighbor analysis to handle massive datasets. Furthermore,we discuss connections to learning distances foruse with nearest neighbor methods, including how randomdecision trees and ensemble methods learn nearest neighborstructure, as well as recent developments in crowdsourcingand graphons.In terms of theory, our focus is on nonasymptotic statisticalguarantees, which we state in the form of how many trainingdata and what algorithm parameters ensure that a nearestneighbor prediction method achieves a user-specified errortolerance. We begin with the most general of such resultsfor nearest neighbor and related kernel regression and classificationin general metric spaces. In such settings in whichwe assume very little structure, what enables successful predictionis smoothness in the function being estimated forregression, and a low probability of landing near the decisionboundary for classification. In practice, these conditionscould be difficult to verify empirically for a real dataset. Wethen cover recent theoretical guarantees on nearest neighborprediction in the three case studies of time series forecasting,recommending products to people over time, and delineatinghuman organs in medical images by looking at imagepatches. In these case studies, clustering structure, whichis easier to verify in data and more readily interpretable bypractitioners, enables successful prediction.\",\"PeriodicalId\":431372,\"journal\":{\"name\":\"Found. Trends Mach. Learn.\",\"volume\":\"157 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"130\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Found. Trends Mach. Learn.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1561/2200000064\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Found. Trends Mach. Learn.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1561/2200000064","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 130

摘要

许多现代预测方法利用最近邻搜索来找到与测试示例最相似的过去训练示例，这种想法至少可以追溯到11世纪，并且经受住了时间的考验。本专著旨在解释这些方法的成功，在理论上，我们涵盖了基于最近邻的回归和分类的基本非渐近统计保证，在实践中，我们收集了近似最近邻搜索的突出方法，这对于依赖最近邻分析来处理大量数据集的缩放预测系统至关重要。此外，我们讨论了与最近邻方法的学习距离的联系，包括随机决策树和集成方法如何学习最近邻结构，以及众包和图论的最新发展。在理论方面，我们的重点是非渐近统计保证，我们以多少训练数据和什么算法参数的形式声明，以确保最近邻预测方法达到用户指定的容错。我们从一般度量空间中最近邻和相关核回归和分类的最一般的结果开始。在这样的设置中，我们假设很少的结构，使成功预测的是回归估计函数的平滑性，以及在分类决策边界附近着陆的低概率。在实践中，这些条件可能很难对真实数据集进行经验验证。然后，我们在时间序列预测、向人们推荐产品以及通过查看图像补丁在医学图像中描绘人体器官的三个案例研究中介绍了最近的理论保证。在这些案例研究中，更容易在数据中验证并且更容易被从业者解释的聚类结构能够实现成功的预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Explaining the Success of Nearest Neighbor Methods in Prediction

Many modern methods for prediction leverage nearest neighborsearch to find past training examples most similar toa test example, an idea that dates back in text to at leastthe 11th century and has stood the test of time. This monographaims to explain the success of these methods, both intheory, for which we cover foundational nonasymptotic statisticalguarantees on nearest-neighbor-based regression andclassification, and in practice, for which we gather prominentmethods for approximate nearest neighbor search thathave been essential to scaling prediction systems reliant onnearest neighbor analysis to handle massive datasets. Furthermore,we discuss connections to learning distances foruse with nearest neighbor methods, including how randomdecision trees and ensemble methods learn nearest neighborstructure, as well as recent developments in crowdsourcingand graphons.In terms of theory, our focus is on nonasymptotic statisticalguarantees, which we state in the form of how many trainingdata and what algorithm parameters ensure that a nearestneighbor prediction method achieves a user-specified errortolerance. We begin with the most general of such resultsfor nearest neighbor and related kernel regression and classificationin general metric spaces. In such settings in whichwe assume very little structure, what enables successful predictionis smoothness in the function being estimated forregression, and a low probability of landing near the decisionboundary for classification. In practice, these conditionscould be difficult to verify empirically for a real dataset. Wethen cover recent theoretical guarantees on nearest neighborprediction in the three case studies of time series forecasting,recommending products to people over time, and delineatinghuman organs in medical images by looking at imagepatches. In these case studies, clustering structure, whichis easier to verify in data and more readily interpretable bypractitioners, enables successful prediction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Found. Trends Mach. Learn.

自引率

0.00%

发文量