{"title":"复制作为一种评估语料库代表性和专业词表可泛化性的方法","authors":"Don Miller","doi":"10.1016/j.acorp.2022.100027","DOIUrl":null,"url":null,"abstract":"<div><p>Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.</p></div>","PeriodicalId":72254,"journal":{"name":"Applied Corpus Linguistics","volume":"2 3","pages":"Article 100027"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2666799122000120/pdfft?md5=99bdd61e7345f961aa3e0dbbbda0d186&pid=1-s2.0-S2666799122000120-main.pdf","citationCount":"1","resultStr":"{\"title\":\"Replication as a means of assessing corpus representativeness and the generalizability of specialized word lists\",\"authors\":\"Don Miller\",\"doi\":\"10.1016/j.acorp.2022.100027\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.</p></div>\",\"PeriodicalId\":72254,\"journal\":{\"name\":\"Applied Corpus Linguistics\",\"volume\":\"2 3\",\"pages\":\"Article 100027\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2666799122000120/pdfft?md5=99bdd61e7345f961aa3e0dbbbda0d186&pid=1-s2.0-S2666799122000120-main.pdf\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Corpus Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666799122000120\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Corpus Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666799122000120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Replication as a means of assessing corpus representativeness and the generalizability of specialized word lists
Considerable energy has gone into designing lists of words that are salient in discourse domains of varying breadth. Over the past two decades, most efforts in designing and validating corpus-based frequency lists have focused on three areas: corpus compilation, item selection criteria, and coverage-based demonstrations of list robustness. As a result, modern corpora are now often much larger and better balanced; the application of additional dispersion statistics allows for better targeting of items with desired distributions; and contemporary lexical frequency lists are proving increasingly efficient, providing ever higher coverage of target texts or achieving such coverage with fewer words. However, despite these important advances, relatively minimal attention has been paid to word list reliability—the extent to which lists can be generalized to the wider discourse domain that has been represented by the corpora upon which they are based. This study begins to address this gap, demonstrating via two word list development case studies (one for Environmental Science and one for Applied Linguistics) that adding iterative reliability analysis—via methodological replication with corpora of increasing size and comparison of items on resulting lists—can be used to: 1) inform corpus design beyond what Biber (1991) terms “situational” parameters, allowing us to see whether corpora are adequately representative of lexical distributions in target discourse domains; and 2) provide valuable insight into the degree of generalizability of word lists we have developed.