The 9 Pitfalls of Data Science最新文献

筛选
英文 中文
Worshiping Computers 崇拜电脑
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0005
Gary Smith, Jay Cordes
{"title":"Worshiping Computers","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0005","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0005","url":null,"abstract":"Computer software, particularly deep neural networks and Monte Carlo simulations, are extremely useful for the specific tasks that they have been designed to do, and they will get even better, much better. However, we should not assume that computers are smarter than us just because they can tell us the first 2000 digits of pi or show us a street map of every city in the world. One of the paradoxical things about computers is that they can excel at things that humans consider difficult (like calculating square roots) while failing at things that humans consider easy (like recognizing stop signs). They can’t pass simple tests like the Winograd Schema Challenge because they do not understand the world the way humans do. They have neither common sense nor wisdom. They are our tools, not our masters.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Doing Harm 做伤害
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0010
Gary Smith, Jason L. Cordes
{"title":"Doing Harm","authors":"Gary Smith, Jason L. Cordes","doi":"10.1093/oso/9780198844396.003.0010","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0010","url":null,"abstract":"An unfortunate reality in the age of big data is Big Brother monitoring us incessantly. Big Brother is indeed watching, but it is big business as well as big government collecting detailed information about everything we do so that they can predict our actions and manipulate our behavior. Big business and big government monitor our credit cards, checking accounts, computers, and telephones, watch us on surveillance cameras, and purchase data from firms dedicated to finding out everything they can about each and every one of us. Good data scientists proceed cautiously, respectful of our rights and our privacy. The Golden Rule applies to data science: treat others as you would like to be treated.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133504302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Putting Data Before Theory 数据优先于理论
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0003
G. Smith, J. Cordes
{"title":"Putting Data Before Theory","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0003","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0003","url":null,"abstract":"The traditional statistical analysis of data follows what has come to be known as the scientific method: collecting reliable data to test plausible theories. Data mining goes in the other direction, analyzing data without being motivated or encumbered by theories. The fundamental problem with data mining is simple: We think that data patterns are unusual and therefore meaningful. Patterns are, in fact, inevitable and therefore meaningless. This is why data mining is not usually knowledge discovery, but noise discovery. Finding correlations is easy. Good data scientists are not seduced by discovered patterns because they don’t put data before theory. They do not commit Texas Sharpshooter Fallacies or fall into the Feynman Trap.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127555168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Confusing Correlation with Causation 混淆因果关系
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0008
G. Smith, J. Cordes
{"title":"Confusing Correlation with Causation","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0008","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0008","url":null,"abstract":"There is a hierarchy of predictive value that can be extracted from data. At the top of the hierarchy are causal relationships that can be confirmed with a randomized and controlled experiment or a natural experiment. Next best is to establish known or hypothesized relationships ahead of time and then test them and estimate their relative importance. One notch lower are associations found in historical data that are tested on fresh data after considering whether or not they make sense. At the bottom of the hierarchy, with little or no value, are associations found in historical data that are not confirmed by expert opinion or tested with fresh data. Data scientists who use a “correlations are enough” approach should remember that the more data and the more searches, the more likely it is that a discovered statistical relationship is coincidental and useless.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114067108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Being Surprised by Regression Toward the Mean 对趋均数回归感到惊讶
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0009
G. Smith, J. Cordes
{"title":"Being Surprised by Regression Toward the Mean","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0009","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0009","url":null,"abstract":"We are predisposed to discount the role of luck in our lives—to believe that successes are earned and failures deserved. We misinterpret the temporary as permanent and invent theories to explain noise. We overreact when the unexpected happens, and are too quick to make the unexpected the new expected. The key to understanding regression toward the mean is to look behind the data—to recognize that when we see something remarkable, luck was most likely involved and, so, the underlying phenomenon is not as remarkable as it seems. Not to be confused with the gambler’s fallacy where good luck is followed by bad luck, regression toward the mean states that extremely good luck is generally followed by less extreme luck. The Sports Illustrated jinx is nothing more than this. Whenever there is uncertainty, people often make flawed decisions due to an insufficient appreciation of regression toward the mean.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122726860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Case Study 案例研究
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0011
G. Smith, J. Cordes
{"title":"Case Study","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0011","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0011","url":null,"abstract":"In the 1970s banks began selling mortgages to public and private mortgage funds that sell shares to investors. In the late 1990s and early 2000s, many mortgages to “subprime” borrowers with low credit ratings and modest income were approved because banks and mortgage brokers made money by making loans and then selling them, and didn’t care if borrowers defaulted. Matters were complicated by financial engineering and compliant rating agencies. The Great Recession resulted from many people falling into several of the pitfalls of data science. They fooled themselves, they worshipped mathematics, they used bad data, they tortured data, and they did harm.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121857621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fooling Yourself 欺骗你自己
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0007
Gary Smith, Jay Cordes
{"title":"Fooling Yourself","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0007","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0007","url":null,"abstract":"Clowns fool themselves. Scientists don’t. Often, the easiest way to differentiate a data clown from a data scientist is to track the successes and failures of their predictions. Clowns avoid experimentation out of fear that they’re wrong, or wait until after seeing the data before revealing what they expected to find. Scientists share their theories, question their assumptions, and seek opportunities to run experiments that will verify or contradict themselves. Most new theories are not correct and will not be supported by experiments (randomized controlled trials). Scientists are comfortable with that reality and don’t try to ram a square peg in a round hole by torturing data or mangling theories. They know that science works, but only if it’s done right.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125729194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Torturing Data 折磨的数据
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0006
Gary Smith, Jay Cordes
{"title":"Torturing Data","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0006","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0006","url":null,"abstract":"Researchers seeking fame and funding may be tempted to go on fishing expeditions (p-hacking) or to torture the data to find novel, provocative results that will be picked up by the popular media. Provocative findings are provocative because they are novel and unexpected, and they are often novel and unexpected because they are simply not true. The publication effect (or the file drawer effect) keeps the failures hidden and have created a replication crisis. Research that gets reported in the popular media is often wrong—which fools people and undermines the credibility of scientific research.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131970208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Worshiping Math 崇拜数学
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0004
Gary Smith, Jay Cordes
{"title":"Worshiping Math","authors":"Gary Smith, Jay Cordes","doi":"10.1093/oso/9780198844396.003.0004","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0004","url":null,"abstract":"Data-mining tools, in general, tend to be mathematically sophisticated, yet often make implausible assumptions. For example, analysts often assume a normal distribution and disregard the fat tails that warn of “black swans.” Too often, the assumptions are hidden in the math and the people who use the tools are more impressed by the math than curious about the assumptions. Instead of being blinded by math, good data scientists use explanatory variables that make sense. Good data scientists use math, but do not worship it. They know that math is an invaluable tool, but it is not a substitute for common sense, wisdom, or expertise.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127203908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Bad Data 使用坏数据
The 9 Pitfalls of Data Science Pub Date : 2019-07-01 DOI: 10.1093/oso/9780198844396.003.0002
G. Smith, J. Cordes
{"title":"Using Bad Data","authors":"G. Smith, J. Cordes","doi":"10.1093/oso/9780198844396.003.0002","DOIUrl":"https://doi.org/10.1093/oso/9780198844396.003.0002","url":null,"abstract":"Good data scientists consider the reliability of the data, while data clowns don’t. Reported data sometimes systematically misrepresent the phenomena being recorded. Data can be deformed by extremely unusual data—outliers—which can be clerical errors, measurement errors, or flukes that can mislead us if not corrected. Other times, outliers are valuable data. We should always consider if data are skewed by unusual events or distorted by unreported “silent data.” If something is surprising about top-ranked groups, look at the bottom-ranked groups. Consider the possibility of survivorship bias and self-selection bias. Incomplete, inaccurate, or unreliable data can make clowns out of anyone.","PeriodicalId":331229,"journal":{"name":"The 9 Pitfalls of Data Science","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113938805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信