Statistical Significance Testing for Natural Language Processing

IF 5.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics Pub Date : 2020-10-20 DOI:10.1162/coli_r_00388

Edwin Simpson

{"title":"Statistical Significance Testing for Natural Language Processing","authors":"Edwin Simpson","doi":"10.1162/coli_r_00388","DOIUrl":null,"url":null,"abstract":"Like any other science, research in natural language processing (NLP) depends on the ability to draw correct conclusions from experiments. A key tool for this is statistical significance testing: We use it to judge whether a result provides meaningful, generalizable findings or should be taken with a pinch of salt. When comparing new methods against others, performance metrics often differ by only small amounts, so researchers turn to significance tests to show that improved models are genuinely better. Unfortunately, this reasoning often fails because we choose inappropriate significance tests or carry them out incorrectly, making their outcomes meaningless. Or, the test we use may fail to indicate a significant result when a more appropriate test would find one. NLP researchers must avoid these pitfalls to ensure that their evaluations are sound and ultimately avoid wasting time and money through incorrect conclusions. This book guides NLP researchers through the whole process of significance testing, making it easy to select the right kind of test by matching canonical NLP tasks to specific significance testing procedures. As well as being a handbook for researchers, the book provides theoretical background on significance testing, includes new methods that solve problems with significance tests in the world of deep learning and multidataset benchmarks, and describes the open research problems of significance testing for NLP. The book focuses on the task of comparing one algorithm with another. At the core of this is the p-value, the probability that a difference at least as extreme as the one we observed could occur by chance. If the p-value falls below a predetermined threshold, the result is declared significant. Leaving aside the fundamental limitation of turning the validity of results into a binary question with an arbitrary threshold, to be a valid statistical significance test, the p-value must be computed in the right way. The book describes the two crucial properties of an appropriate significance test: The test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, in which the result is incorrectly declared significant. Common mistakes that lead to type 1 errors include deploying tests that make incorrect assumptions, such as independence between data points. The power of a test refers to its ability to detect a significant result and therefore to avoid type 2 errors. Here, knowledge of the data and experiment must be used to choose a test that makes the correct assumptions. There is a trade-off between validity and power, but for the most common NLP tasks (language modeling, sequence labeling, translation, etc.), there are clear choices of tests that provide a good balance.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"46 1","pages":"905-908"},"PeriodicalIF":5.3000,"publicationDate":"2020-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1162/coli_r_00388","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_r_00388","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 36

Abstract

Like any other science, research in natural language processing (NLP) depends on the ability to draw correct conclusions from experiments. A key tool for this is statistical significance testing: We use it to judge whether a result provides meaningful, generalizable findings or should be taken with a pinch of salt. When comparing new methods against others, performance metrics often differ by only small amounts, so researchers turn to significance tests to show that improved models are genuinely better. Unfortunately, this reasoning often fails because we choose inappropriate significance tests or carry them out incorrectly, making their outcomes meaningless. Or, the test we use may fail to indicate a significant result when a more appropriate test would find one. NLP researchers must avoid these pitfalls to ensure that their evaluations are sound and ultimately avoid wasting time and money through incorrect conclusions. This book guides NLP researchers through the whole process of significance testing, making it easy to select the right kind of test by matching canonical NLP tasks to specific significance testing procedures. As well as being a handbook for researchers, the book provides theoretical background on significance testing, includes new methods that solve problems with significance tests in the world of deep learning and multidataset benchmarks, and describes the open research problems of significance testing for NLP. The book focuses on the task of comparing one algorithm with another. At the core of this is the p-value, the probability that a difference at least as extreme as the one we observed could occur by chance. If the p-value falls below a predetermined threshold, the result is declared significant. Leaving aside the fundamental limitation of turning the validity of results into a binary question with an arbitrary threshold, to be a valid statistical significance test, the p-value must be computed in the right way. The book describes the two crucial properties of an appropriate significance test: The test must be both valid and powerful. Validity refers to the avoidance of type 1 errors, in which the result is incorrectly declared significant. Common mistakes that lead to type 1 errors include deploying tests that make incorrect assumptions, such as independence between data points. The power of a test refers to its ability to detect a significant result and therefore to avoid type 2 errors. Here, knowledge of the data and experiment must be used to choose a test that makes the correct assumptions. There is a trade-off between validity and power, but for the most common NLP tasks (language modeling, sequence labeling, translation, etc.), there are clear choices of tests that provide a good balance.

查看原文本刊更多论文

自然语言处理的统计显著性检验

像任何其他科学一样，自然语言处理(NLP)的研究依赖于从实验中得出正确结论的能力。其中一个关键工具是统计显著性检验:我们用它来判断一个结果是否提供了有意义的、可推广的发现，还是应该持保留态度。当将新方法与其他方法进行比较时，性能指标通常只有很小的差异，因此研究人员求助于显著性检验来证明改进后的模型确实更好。不幸的是，这种推理常常失败，因为我们选择了不恰当的显著性检验，或者执行得不正确，使结果毫无意义。或者，当一个更合适的测试可以找到一个重要的结果时，我们使用的测试可能无法指示一个重要的结果。NLP研究人员必须避免这些陷阱，以确保他们的评估是合理的，并最终避免浪费时间和金钱通过错误的结论。这本书通过显著性测试的整个过程指导NLP研究人员，使其易于通过匹配规范的NLP任务到特定的显著性测试程序来选择正确的测试类型。作为研究人员的手册，本书提供了显著性测试的理论背景，包括解决深度学习和多数据集基准世界中显著性测试问题的新方法，并描述了NLP显著性测试的开放研究问题。这本书着重于比较一种算法与另一种算法的任务。其核心是p值，即至少与我们观察到的差异一样极端的差异偶然发生的概率。如果p值低于预定的阈值，则声明结果显著。撇开将结果的有效性转化为具有任意阈值的二元问题的基本限制不谈，要成为有效的统计显著性检验，p值必须以正确的方式计算。这本书描述了一个适当的显著性测试的两个关键属性:测试必须是有效的和强大的。有效性指的是避免类型1错误，即错误地声明结果显著。导致类型1错误的常见错误包括部署做出不正确假设的测试，例如数据点之间的独立性。测试的能力是指它能够检测到重要的结果，从而避免第2类错误。在这里，必须使用数据和实验的知识来选择做出正确假设的测试。有效性和能力之间存在权衡，但对于最常见的NLP任务(语言建模、序列标记、翻译等)，有明确的测试选择，可以提供良好的平衡。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.