统计学习的要素:数据挖掘、推断和预测

IF 3 1区数学 Q1 STATISTICS & PROBABILITY

Journal of the American Statistical Association Pub Date : 2004-06-01 DOI:10.1198/jasa.2004.s339

D. Ruppert

{"title":"统计学习的要素:数据挖掘、推断和预测","authors":"D. Ruppert","doi":"10.1198/jasa.2004.s339","DOIUrl":null,"url":null,"abstract":"In the words of the authors, the goal of this book was to “bring together many of the important new ideas in learning, and explain them in a statistical framework.” The authors have been quite successful in achieving this objective, and their work is a welcome addition to the statistics and learning literatures. Statistics has always been interdisciplinary, borrowing ideas from diverse elds and repaying the debt with contributions, both theoretical and practical, to the other intellectual disciplines. For statistical learning, this cross-fertilization is especially noticeable. This book is a valuable resource, both for the statistician needing an introduction to machine learning and related elds and for the computer scientist wishing to learn more about statistics. Statisticians will especially appreciate that it is written in their own language. The level of the book is roughly that of a second-year doctoral student in statistics, and it will be useful as a textbook for such students. In a stimulating article, Breiman (2001) argued that statistics has been focused too much on a “data modeling culture,” where the model is paramount. Breiman argued instead for an “algorithmic modeling culture,” with emphasis on black-box types of prediction. Breiman’s article is controversial, and in his discussion, Efron objects that “prediction is certainly an interesting subject, but Leo’s paper overstates both its role and our profession’s lack of interest in it.” Although I mostly agree with Efron, I worry that the courses offered by most statistics departments include little, if any, treatment of statistical learning and prediction. (Stanford, where Efron and the authors of this book teach, is an exception.) Graduate students in statistics certainly need to know more than they do now about prediction, machine learning, statistical learning, and data mining (not disjoint subjects). I hope that graduate courses covering the topics of this book will become more common in statistics curricula. Most of the book is focused on supervised learning, where one has inputs and outputs from some system and wishes to predict unknown outputs corresponding to known inputs. The methods discussed for supervised learning include linear and logistic regression; basis expansion, such as splines and wavelets; kernel techniques, such as local regression, local likelihood, and radial basis functions; neural networks; additive models; decision trees based on recursive partitioning, such as CART; and support vector machines. There is a nal chapter on unsupervised learning, including association rules, cluster analysis, self-organizing maps, principal components and curves, and independent component analysis. Many statisticians will be unfamiliar with at least some of these algorithms. Association rules are popular for mining commercial data in what is called “market basket analysis.” The aim is to discover types of products often purchased together. Such knowledge can be used to develop marketing strategies, such as store or catalog layouts. Self-organizing maps (SOMs) involve essentially constrained k-means clustering, where prototypes are mapped to a two-dimensional curved coordinate system. Independent components analysis is similar to principal components analysis and factor analysis, but it uses higher-order moments to achieve independence, not merely zero correlation between components. A strength of the book is the attempt to organize a plethora of methods into a coherent whole. The relationships among the methods are emphasized. I know of no other book that covers so much ground. Of course, with such broad coverage, it is not possible to cover any single topic in great depth, so this book will encourage further reading. Fortunately, each chapter includes bibliographic notes surveying the recent literature. These notes and the extensive references provide a good introduction to the learning literature, including much outside of statistics. The book might be more suitable as a textbook if less material were covered in greater depth; however, such a change would compromise the book’s usefulness as a reference, and so I am happier with the book as it was written.","PeriodicalId":17227,"journal":{"name":"Journal of the American Statistical Association","volume":"99 1","pages":"567 - 567"},"PeriodicalIF":3.0000,"publicationDate":"2004-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1198/jasa.2004.s339","citationCount":"17823","resultStr":"{\"title\":\"The Elements of Statistical Learning: Data Mining, Inference, and Prediction\",\"authors\":\"D. Ruppert\",\"doi\":\"10.1198/jasa.2004.s339\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the words of the authors, the goal of this book was to “bring together many of the important new ideas in learning, and explain them in a statistical framework.” The authors have been quite successful in achieving this objective, and their work is a welcome addition to the statistics and learning literatures. Statistics has always been interdisciplinary, borrowing ideas from diverse elds and repaying the debt with contributions, both theoretical and practical, to the other intellectual disciplines. For statistical learning, this cross-fertilization is especially noticeable. This book is a valuable resource, both for the statistician needing an introduction to machine learning and related elds and for the computer scientist wishing to learn more about statistics. Statisticians will especially appreciate that it is written in their own language. The level of the book is roughly that of a second-year doctoral student in statistics, and it will be useful as a textbook for such students. In a stimulating article, Breiman (2001) argued that statistics has been focused too much on a “data modeling culture,” where the model is paramount. Breiman argued instead for an “algorithmic modeling culture,” with emphasis on black-box types of prediction. Breiman’s article is controversial, and in his discussion, Efron objects that “prediction is certainly an interesting subject, but Leo’s paper overstates both its role and our profession’s lack of interest in it.” Although I mostly agree with Efron, I worry that the courses offered by most statistics departments include little, if any, treatment of statistical learning and prediction. (Stanford, where Efron and the authors of this book teach, is an exception.) Graduate students in statistics certainly need to know more than they do now about prediction, machine learning, statistical learning, and data mining (not disjoint subjects). I hope that graduate courses covering the topics of this book will become more common in statistics curricula. Most of the book is focused on supervised learning, where one has inputs and outputs from some system and wishes to predict unknown outputs corresponding to known inputs. The methods discussed for supervised learning include linear and logistic regression; basis expansion, such as splines and wavelets; kernel techniques, such as local regression, local likelihood, and radial basis functions; neural networks; additive models; decision trees based on recursive partitioning, such as CART; and support vector machines. There is a nal chapter on unsupervised learning, including association rules, cluster analysis, self-organizing maps, principal components and curves, and independent component analysis. Many statisticians will be unfamiliar with at least some of these algorithms. Association rules are popular for mining commercial data in what is called “market basket analysis.” The aim is to discover types of products often purchased together. Such knowledge can be used to develop marketing strategies, such as store or catalog layouts. Self-organizing maps (SOMs) involve essentially constrained k-means clustering, where prototypes are mapped to a two-dimensional curved coordinate system. Independent components analysis is similar to principal components analysis and factor analysis, but it uses higher-order moments to achieve independence, not merely zero correlation between components. A strength of the book is the attempt to organize a plethora of methods into a coherent whole. The relationships among the methods are emphasized. I know of no other book that covers so much ground. Of course, with such broad coverage, it is not possible to cover any single topic in great depth, so this book will encourage further reading. Fortunately, each chapter includes bibliographic notes surveying the recent literature. These notes and the extensive references provide a good introduction to the learning literature, including much outside of statistics. The book might be more suitable as a textbook if less material were covered in greater depth; however, such a change would compromise the book’s usefulness as a reference, and so I am happier with the book as it was written.\",\"PeriodicalId\":17227,\"journal\":{\"name\":\"Journal of the American Statistical Association\",\"volume\":\"99 1\",\"pages\":\"567 - 567\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2004-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1198/jasa.2004.s339\",\"citationCount\":\"17823\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the American Statistical Association\",\"FirstCategoryId\":\"100\",\"ListUrlMain\":\"https://doi.org/10.1198/jasa.2004.s339\",\"RegionNum\":1,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Statistical Association","FirstCategoryId":"100","ListUrlMain":"https://doi.org/10.1198/jasa.2004.s339","RegionNum":1,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 17823

摘要

用作者的话来说，这本书的目标是“汇集学习中许多重要的新思想，并在统计框架中解释它们。”作者已经相当成功地实现了这一目标，他们的工作是一个受欢迎的补充统计和学习文献。统计一直是跨学科的，从不同的学科中借鉴思想，并以理论和实践的贡献来偿还其他知识学科的债务。对于统计学习来说，这种相互作用尤为显著。这本书是一个有价值的资源，既为统计学家需要介绍机器学习和相关学科，并为计算机科学家希望了解更多关于统计。统计学家会特别欣赏它是用他们自己的语言写的。这本书的水平大致相当于统计学博士二年级学生的水平，对于这些学生来说，它将是一本有用的教科书。Breiman(2001)在一篇令人振奋的文章中指出，统计学过于关注“数据建模文化”，在这种文化中，模型是最重要的。Breiman转而主张一种“算法建模文化”，强调黑盒类型的预测。Breiman的文章是有争议的，在他的讨论中，Efron反对说“预测当然是一个有趣的主题，但是Leo的论文夸大了它的作用和我们的专业对它缺乏兴趣。”虽然我基本上同意埃夫隆的观点，但我担心，大多数统计系提供的课程，即使有，也很少涉及统计学习和预测。(埃夫隆和本书作者任教的斯坦福大学是个例外。)统计学研究生当然需要比现在更多地了解预测、机器学习、统计学习和数据挖掘(不是互不相关的学科)。我希望涵盖本书主题的研究生课程将在统计学课程中变得更加普遍。这本书的大部分内容都集中在监督学习上，其中一个人有来自某个系统的输入和输出，并希望预测对应于已知输入的未知输出。讨论的监督学习方法包括线性回归和逻辑回归;基展开，如样条和小波;核技术，如局部回归、局部似然和径向基函数;神经网络;加性模型;基于递归划分的决策树，如CART;支持向量机。有一个关于无监督学习的基本章节，包括关联规则、聚类分析、自组织映射、主成分和曲线以及独立成分分析。许多统计学家至少对其中一些算法不熟悉。关联规则在所谓的“市场购物篮分析”中用于挖掘商业数据。目的是发现经常一起购买的产品类型。这些知识可以用来制定营销策略，如商店或目录布局。自组织映射(SOMs)本质上涉及约束k-均值聚类，其中原型映射到二维曲线坐标系。独立成分分析类似于主成分分析和因子分析，但它使用高阶矩来实现独立性，而不仅仅是成分之间的零相关。这本书的一个优点是试图将过多的方法组织成一个连贯的整体。强调了各种方法之间的关系。我不知道还有哪本书涵盖这么多内容。当然，由于覆盖面如此之广，不可能对任何一个主题都有深入的了解，因此本书将鼓励读者进一步阅读。幸运的是，每一章都包括书目注释，对最近的文献进行了调查。这些注释和广泛的参考文献为学习文献提供了很好的介绍，包括统计学以外的许多内容。如果把较少的内容讲得更深入，这本书可能更适合作为教科书;然而，这样的改变会损害这本书作为参考的有用性，所以我对这本书的写作更满意。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

In the words of the authors, the goal of this book was to “bring together many of the important new ideas in learning, and explain them in a statistical framework.” The authors have been quite successful in achieving this objective, and their work is a welcome addition to the statistics and learning literatures. Statistics has always been interdisciplinary, borrowing ideas from diverse elds and repaying the debt with contributions, both theoretical and practical, to the other intellectual disciplines. For statistical learning, this cross-fertilization is especially noticeable. This book is a valuable resource, both for the statistician needing an introduction to machine learning and related elds and for the computer scientist wishing to learn more about statistics. Statisticians will especially appreciate that it is written in their own language. The level of the book is roughly that of a second-year doctoral student in statistics, and it will be useful as a textbook for such students. In a stimulating article, Breiman (2001) argued that statistics has been focused too much on a “data modeling culture,” where the model is paramount. Breiman argued instead for an “algorithmic modeling culture,” with emphasis on black-box types of prediction. Breiman’s article is controversial, and in his discussion, Efron objects that “prediction is certainly an interesting subject, but Leo’s paper overstates both its role and our profession’s lack of interest in it.” Although I mostly agree with Efron, I worry that the courses offered by most statistics departments include little, if any, treatment of statistical learning and prediction. (Stanford, where Efron and the authors of this book teach, is an exception.) Graduate students in statistics certainly need to know more than they do now about prediction, machine learning, statistical learning, and data mining (not disjoint subjects). I hope that graduate courses covering the topics of this book will become more common in statistics curricula. Most of the book is focused on supervised learning, where one has inputs and outputs from some system and wishes to predict unknown outputs corresponding to known inputs. The methods discussed for supervised learning include linear and logistic regression; basis expansion, such as splines and wavelets; kernel techniques, such as local regression, local likelihood, and radial basis functions; neural networks; additive models; decision trees based on recursive partitioning, such as CART; and support vector machines. There is a nal chapter on unsupervised learning, including association rules, cluster analysis, self-organizing maps, principal components and curves, and independent component analysis. Many statisticians will be unfamiliar with at least some of these algorithms. Association rules are popular for mining commercial data in what is called “market basket analysis.” The aim is to discover types of products often purchased together. Such knowledge can be used to develop marketing strategies, such as store or catalog layouts. Self-organizing maps (SOMs) involve essentially constrained k-means clustering, where prototypes are mapped to a two-dimensional curved coordinate system. Independent components analysis is similar to principal components analysis and factor analysis, but it uses higher-order moments to achieve independence, not merely zero correlation between components. A strength of the book is the attempt to organize a plethora of methods into a coherent whole. The relationships among the methods are emphasized. I know of no other book that covers so much ground. Of course, with such broad coverage, it is not possible to cover any single topic in great depth, so this book will encourage further reading. Fortunately, each chapter includes bibliographic notes surveying the recent literature. These notes and the extensive references provide a good introduction to the learning literature, including much outside of statistics. The book might be more suitable as a textbook if less material were covered in greater depth; however, such a change would compromise the book’s usefulness as a reference, and so I am happier with the book as it was written.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the American Statistical Association 数学-统计学与概率论

CiteScore

7.50

自引率

8.10%

发文量

168

审稿时长

12 months

期刊介绍： Established in 1888 and published quarterly in March, June, September, and December, the Journal of the American Statistical Association ( JASA ) has long been considered the premier journal of statistical science. Articles focus on statistical applications, theory, and methods in economic, social, physical, engineering, and health sciences. Important books contributing to statistical advancement are reviewed in JASA . JASA is indexed in Current Index to Statistics and MathSci Online and reviewed in Mathematical Reviews. JASA is abstracted by Access Company and is indexed and abstracted in the SRM Database of Social Research Methodology.