语法性和语言建模

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems Pub Date : 2020-11-01 DOI:10.18653/v1/2020.eval4nlp-1.11

Jingcheng Niu, Gerald Penn

{"title":"语法性和语言建模","authors":"Jingcheng Niu, Gerald Penn","doi":"10.18653/v1/2020.eval4nlp-1.11","DOIUrl":null,"url":null,"abstract":"Ever since Pereira (2000) provided evidence against Chomsky’s (1957) conjecture that statistical language modelling is incommensurable with the aims of grammaticality prediction as a research enterprise, a new area of research has emerged that regards statistical language models as “psycholinguistic subjects” and probes their ability to acquire syntactic knowledge. The advent of The Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) has earned a spot on the leaderboard for acceptability judgements, and the polemic between Lau et al. (2017) and Sprouse et al. (2018) has raised fundamental questions about the nature of grammaticality and how acceptability judgements should be elicited. All the while, we are told that neural language models continue to improve. That is not an easy claim to test at present, however, because there is almost no agreement on how to measure their improvement when it comes to grammaticality and acceptability judgements. The GLUE leaderboard bundles CoLA together with a Matthews correlation coefficient (MCC), although probably because CoLA’s seminal publication was using it to compute inter-rater reliabilities. Researchers working in this area have used other accuracy and correlation scores, often driven by a need to reconcile and compare various discrete and continuous variables with each other. The score that we will advocate for in this paper, the point biserial correlation, in fact compares a discrete variable (for us, acceptability judgements) to a continuous variable (for us, neural language model probabilities). The only previous work in this area to choose the PBC that we are aware of is Sprouse et al. (2018a), and that paper actually applied it backwards (with some justification) so that the language model probability was treated as the discrete binary variable by setting a threshold. With the PBC in mind, we will first reappraise some recent work in syntactically targeted linguistic evaluations (Hu et al., 2020), arguing that while their experimental design sets a new high watermark for this topic, their results may not prove what they have claimed. We then turn to the task-independent assessment of language models as grammaticality classifiers. Prior to the introduction of the GLUE leaderboard, the vast majority of this assessment was essentially anecdotal, and we find the use of the MCC in this regard to be problematic. We conduct several studies with PBCs to compare several popular language models. We also study the effects of several variables such as normalization and data homogeneity on PBC.","PeriodicalId":448066,"journal":{"name":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","volume":"338 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Grammaticality and Language Modelling\",\"authors\":\"Jingcheng Niu, Gerald Penn\",\"doi\":\"10.18653/v1/2020.eval4nlp-1.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Ever since Pereira (2000) provided evidence against Chomsky’s (1957) conjecture that statistical language modelling is incommensurable with the aims of grammaticality prediction as a research enterprise, a new area of research has emerged that regards statistical language models as “psycholinguistic subjects” and probes their ability to acquire syntactic knowledge. The advent of The Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) has earned a spot on the leaderboard for acceptability judgements, and the polemic between Lau et al. (2017) and Sprouse et al. (2018) has raised fundamental questions about the nature of grammaticality and how acceptability judgements should be elicited. All the while, we are told that neural language models continue to improve. That is not an easy claim to test at present, however, because there is almost no agreement on how to measure their improvement when it comes to grammaticality and acceptability judgements. The GLUE leaderboard bundles CoLA together with a Matthews correlation coefficient (MCC), although probably because CoLA’s seminal publication was using it to compute inter-rater reliabilities. Researchers working in this area have used other accuracy and correlation scores, often driven by a need to reconcile and compare various discrete and continuous variables with each other. The score that we will advocate for in this paper, the point biserial correlation, in fact compares a discrete variable (for us, acceptability judgements) to a continuous variable (for us, neural language model probabilities). The only previous work in this area to choose the PBC that we are aware of is Sprouse et al. (2018a), and that paper actually applied it backwards (with some justification) so that the language model probability was treated as the discrete binary variable by setting a threshold. With the PBC in mind, we will first reappraise some recent work in syntactically targeted linguistic evaluations (Hu et al., 2020), arguing that while their experimental design sets a new high watermark for this topic, their results may not prove what they have claimed. We then turn to the task-independent assessment of language models as grammaticality classifiers. Prior to the introduction of the GLUE leaderboard, the vast majority of this assessment was essentially anecdotal, and we find the use of the MCC in this regard to be problematic. We conduct several studies with PBCs to compare several popular language models. We also study the effects of several variables such as normalization and data homogeneity on PBC.\",\"PeriodicalId\":448066,\"journal\":{\"name\":\"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems\",\"volume\":\"338 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2020.eval4nlp-1.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.eval4nlp-1.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

自从Pereira(2000)提供证据反驳乔姆斯基(1957)的猜想，即统计语言建模与作为研究事业的语法预测目标不可通约以来，一个新的研究领域出现了，它将统计语言模型视为“心理语言学主体”，并探索它们获取句法知识的能力。语言可接受性语料库(CoLA) (Warstadt等人，2019)的出现在可接受性判断的排行榜上赢得了一席之地，Lau等人(2017)和Sprouse等人(2018)之间的争论提出了关于语法性本质以及如何引出可接受性判断的基本问题。一直以来，我们都被告知神经语言模型在不断改进。然而，目前要验证这一点并不容易，因为对于如何衡量他们在语法性和可接受性判断方面的进步，几乎没有达成一致意见。GLUE排行榜将CoLA与马修斯相关系数(MCC)捆绑在一起，尽管可能是因为CoLA的开创性出版物使用它来计算评级者之间的可靠性。在这一领域工作的研究人员使用了其他准确性和相关性评分，通常是由于需要调和和比较各种离散和连续变量。我们将在本文中提倡的分数，即点双列相关性，实际上是将离散变量(对我们来说，可接受性判断)与连续变量(对我们来说，神经语言模型概率)进行比较。在这一领域，我们所知道的唯一选择PBC的先前工作是Sprouse等人(2018a)，并且该论文实际上向后应用了它(有一些理由)，以便通过设置阈值将语言模型概率视为离散二进制变量。考虑到PBC，我们将首先重新评估最近在句法目标语言评估方面的一些工作(Hu et al.， 2020)，认为尽管他们的实验设计为该主题设定了新的高水位，但他们的结果可能无法证明他们所声称的。然后我们转向作为语法分类器的语言模型的任务独立评估。在引入GLUE排行榜之前，绝大多数的评估基本上都是传闻，我们发现在这方面使用MCC是有问题的。我们对PBCs进行了几项研究，以比较几种流行的语言模型。我们还研究了归一化和数据同质性等变量对PBC的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Grammaticality and Language Modelling

Ever since Pereira (2000) provided evidence against Chomsky’s (1957) conjecture that statistical language modelling is incommensurable with the aims of grammaticality prediction as a research enterprise, a new area of research has emerged that regards statistical language models as “psycholinguistic subjects” and probes their ability to acquire syntactic knowledge. The advent of The Corpus of Linguistic Acceptability (CoLA) (Warstadt et al., 2019) has earned a spot on the leaderboard for acceptability judgements, and the polemic between Lau et al. (2017) and Sprouse et al. (2018) has raised fundamental questions about the nature of grammaticality and how acceptability judgements should be elicited. All the while, we are told that neural language models continue to improve. That is not an easy claim to test at present, however, because there is almost no agreement on how to measure their improvement when it comes to grammaticality and acceptability judgements. The GLUE leaderboard bundles CoLA together with a Matthews correlation coefficient (MCC), although probably because CoLA’s seminal publication was using it to compute inter-rater reliabilities. Researchers working in this area have used other accuracy and correlation scores, often driven by a need to reconcile and compare various discrete and continuous variables with each other. The score that we will advocate for in this paper, the point biserial correlation, in fact compares a discrete variable (for us, acceptability judgements) to a continuous variable (for us, neural language model probabilities). The only previous work in this area to choose the PBC that we are aware of is Sprouse et al. (2018a), and that paper actually applied it backwards (with some justification) so that the language model probability was treated as the discrete binary variable by setting a threshold. With the PBC in mind, we will first reappraise some recent work in syntactically targeted linguistic evaluations (Hu et al., 2020), arguing that while their experimental design sets a new high watermark for this topic, their results may not prove what they have claimed. We then turn to the task-independent assessment of language models as grammaticality classifiers. Prior to the introduction of the GLUE leaderboard, the vast majority of this assessment was essentially anecdotal, and we find the use of the MCC in this regard to be problematic. We conduct several studies with PBCs to compare several popular language models. We also study the effects of several variables such as normalization and data homogeneity on PBC.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

自引率

0.00%

发文量