理论与实践中的统计显著性检验

Ben Carterette
{"title":"理论与实践中的统计显著性检验","authors":"Ben Carterette","doi":"10.1145/3341981.3358959","DOIUrl":null,"url":null,"abstract":"The past 25 years have seen a great improvement in the rigor of experimentation on information access problems. This is due primarily to three factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtreval Conference~\\citetrecbook ), the increased ease of online A/B testing on large user populations, and the increased practice of statistical hypothesis testing to determine whether observed improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program committees, and journal editors; work on information access (IA) problems such as search and recommendation increasingly cannot be published unless it has been evaluated offline using a well-constructed test collection or online on a large user base and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers and developers treat them as a \"black box'': evaluation results go in and a p-value comes out. But because significance is such an important factor in determining what directions to explore and what is published or deployed, using p-values obtained without thought can have consequences for everyone working in IA. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false; could that be the case for IA as well?","PeriodicalId":173154,"journal":{"name":"Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Statistical Significance Testing in Theory and in Practice\",\"authors\":\"Ben Carterette\",\"doi\":\"10.1145/3341981.3358959\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The past 25 years have seen a great improvement in the rigor of experimentation on information access problems. This is due primarily to three factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtreval Conference~\\\\citetrecbook ), the increased ease of online A/B testing on large user populations, and the increased practice of statistical hypothesis testing to determine whether observed improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program committees, and journal editors; work on information access (IA) problems such as search and recommendation increasingly cannot be published unless it has been evaluated offline using a well-constructed test collection or online on a large user base and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers and developers treat them as a \\\"black box'': evaluation results go in and a p-value comes out. But because significance is such an important factor in determining what directions to explore and what is published or deployed, using p-values obtained without thought can have consequences for everyone working in IA. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false; could that be the case for IA as well?\",\"PeriodicalId\":173154,\"journal\":{\"name\":\"Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval\",\"volume\":\"38 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3341981.3358959\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3341981.3358959","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

在过去的25年里,在信息获取问题实验的严谨性方面有了很大的提高。这主要是由于三个因素:高质量的、公开的、可移植的测试集合,如由TREC(文本检索会议~\citetrecbook)制作的测试集合,对大量用户群体进行在线A/B测试的便利性增加,以及统计假设检验的增加,以确定观察到的改进是否可以归因于随机机会以外的其他因素。这些共同为审稿人、项目委员会和期刊编辑创造了一个非常有用的标准;关于信息访问(IA)问题(如搜索和推荐)的工作越来越不能发表,除非使用构造良好的测试集离线评估,或者在大型用户基础上在线评估,并显示在良好的基线上产生统计上显著的改进。但是,正如俗话所说,任何锋利到有用的工具也锋利到危险的程度。显著性统计检验被广泛误解。大多数研究人员和开发人员将其视为“黑箱”:评估结果输入,p值输出。但由于重要性是决定探索方向和发布或部署内容的重要因素,因此使用未经思考获得的p值可能会对从事IA工作的每个人产生影响。约阿尼迪斯认为,生物医学科学的主要后果是,大多数已发表的研究结果都是错误的;内务部也会这样吗?
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Statistical Significance Testing in Theory and in Practice
The past 25 years have seen a great improvement in the rigor of experimentation on information access problems. This is due primarily to three factors: high-quality, public, portable test collections such as those produced by TREC (the Text REtreval Conference~\citetrecbook ), the increased ease of online A/B testing on large user populations, and the increased practice of statistical hypothesis testing to determine whether observed improvements can be ascribed to something other than random chance. Together these create a very useful standard for reviewers, program committees, and journal editors; work on information access (IA) problems such as search and recommendation increasingly cannot be published unless it has been evaluated offline using a well-constructed test collection or online on a large user base and shown to produce a statistically significant improvement over a good baseline. But, as the saying goes, any tool sharp enough to be useful is also sharp enough to be dangerous. Statistical tests of significance are widely misunderstood. Most researchers and developers treat them as a "black box'': evaluation results go in and a p-value comes out. But because significance is such an important factor in determining what directions to explore and what is published or deployed, using p-values obtained without thought can have consequences for everyone working in IA. Ioannidis has argued that the main consequence in the biomedical sciences is that most published research findings are false; could that be the case for IA as well?
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信