卡方检验不足以检验语料库之间的词汇差异

Lit. Linguistic Comput. Pub Date : 2014-06-01 DOI:10.1093/llc/fqt020

Yves Bestgen

{"title":"卡方检验不足以检验语料库之间的词汇差异","authors":"Yves Bestgen","doi":"10.1093/llc/fqt020","DOIUrl":null,"url":null,"abstract":"Pearson's chi-squared test is probably the most popular statistical test used in corpus linguistics, particularly for studying linguistic variations between corpora. Oakes and Farrow (Literary and Linguistic Computing, 2007, 22, 85-99) proposed various adaptations of this test in order to allow for the simultaneous comparison of more than two corpora, while also yielding an almost correct Type I error rate (i.e. claiming that a word is most frequently found in a variety of English, when in actuality this is not the case). By means of resampling procedures, the present study shows that when used in this context, the chi-squared test produces far too many significant results, even in its modified version. Several potential approaches to circumventing this problem are discussed in the conclusion.","PeriodicalId":235034,"journal":{"name":"Lit. Linguistic Comput.","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":"{\"title\":\"Inadequacy of the chi-squared test to examine vocabulary differences between corpora\",\"authors\":\"Yves Bestgen\",\"doi\":\"10.1093/llc/fqt020\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pearson's chi-squared test is probably the most popular statistical test used in corpus linguistics, particularly for studying linguistic variations between corpora. Oakes and Farrow (Literary and Linguistic Computing, 2007, 22, 85-99) proposed various adaptations of this test in order to allow for the simultaneous comparison of more than two corpora, while also yielding an almost correct Type I error rate (i.e. claiming that a word is most frequently found in a variety of English, when in actuality this is not the case). By means of resampling procedures, the present study shows that when used in this context, the chi-squared test produces far too many significant results, even in its modified version. Several potential approaches to circumventing this problem are discussed in the conclusion.\",\"PeriodicalId\":235034,\"journal\":{\"name\":\"Lit. Linguistic Comput.\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Lit. Linguistic Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/llc/fqt020\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lit. Linguistic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/llc/fqt020","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

摘要

皮尔逊卡方检验可能是语料库语言学中最常用的统计检验，特别是用于研究语料库之间的语言差异。Oakes和Farrow(文学与语言计算，2007,22,85-99)提出了这个测试的各种改编，以便允许同时比较两个以上的语料库，同时也产生了几乎正确的第一类错误率(即声称一个词在各种英语中最常见，而实际上并非如此)。通过重新抽样程序，本研究表明，当在这种情况下使用卡方检验时，即使在其修改版本中，也会产生太多显著的结果。在结论部分讨论了几种可能解决这一问题的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Inadequacy of the chi-squared test to examine vocabulary differences between corpora

Pearson's chi-squared test is probably the most popular statistical test used in corpus linguistics, particularly for studying linguistic variations between corpora. Oakes and Farrow (Literary and Linguistic Computing, 2007, 22, 85-99) proposed various adaptations of this test in order to allow for the simultaneous comparison of more than two corpora, while also yielding an almost correct Type I error rate (i.e. claiming that a word is most frequently found in a variety of English, when in actuality this is not the case). By means of resampling procedures, the present study shows that when used in this context, the chi-squared test produces far too many significant results, even in its modified version. Several potential approaches to circumventing this problem are discussed in the conclusion.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Lit. Linguistic Comput.

自引率

0.00%

发文量