{"title":"diy语料库和特定学科的写作:关注语料库构建和清理的好处(在大型语言模型的时代)","authors":"Maya Sfeir","doi":"10.1016/j.esp.2025.04.002","DOIUrl":null,"url":null,"abstract":"<div><div>Within the field of data-driven learning (DDL), an increasing number of studies have underscored the benefits of creating small specialized DIY corpora for the teaching/learning of discipline-specific writing. However, in these studies, corpus cleaning is often described as an optional step in the process of corpus creation and is frequently presented as a tedious, unnecessary, and time-consuming task, with the majority of scholars calling for the creation of “quick and dirty” corpora. In this paper, we re-examine corpus creation, namely corpus cleaning and metadata construction, for discipline-specific writing. More specifically, our paper seeks to reframe corpus cleaning and metadata construction as meaningful and purposeful activities that increase learners’ awareness of disciplinary norms and conventions, particularly in a comparative context. We base our analysis on the reflections provided by learners from various disciplines who designed, compiled, cleaned, and analyzed corpora, along with the final papers they drafted for the courses they took in the Department of English at a teaching-focused research university in the Middle East. Corpus cleaning and metadata creation, as we hope to show, not only make visible the invisible writing conventions within disciplines, including the integration of evidence and raw data, but also position language learners as data engineers, promoting their critical awareness of the role and nature of (language) data in the age of Large Language Models (LLMs).</div></div>","PeriodicalId":47809,"journal":{"name":"English for Specific Purposes","volume":"79 ","pages":"Pages 70-86"},"PeriodicalIF":3.2000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Do-it-yourself corpora and discipline-specific writing: A focus on the benefits of corpus building and cleaning (in the age of large language models)\",\"authors\":\"Maya Sfeir\",\"doi\":\"10.1016/j.esp.2025.04.002\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Within the field of data-driven learning (DDL), an increasing number of studies have underscored the benefits of creating small specialized DIY corpora for the teaching/learning of discipline-specific writing. However, in these studies, corpus cleaning is often described as an optional step in the process of corpus creation and is frequently presented as a tedious, unnecessary, and time-consuming task, with the majority of scholars calling for the creation of “quick and dirty” corpora. In this paper, we re-examine corpus creation, namely corpus cleaning and metadata construction, for discipline-specific writing. More specifically, our paper seeks to reframe corpus cleaning and metadata construction as meaningful and purposeful activities that increase learners’ awareness of disciplinary norms and conventions, particularly in a comparative context. We base our analysis on the reflections provided by learners from various disciplines who designed, compiled, cleaned, and analyzed corpora, along with the final papers they drafted for the courses they took in the Department of English at a teaching-focused research university in the Middle East. Corpus cleaning and metadata creation, as we hope to show, not only make visible the invisible writing conventions within disciplines, including the integration of evidence and raw data, but also position language learners as data engineers, promoting their critical awareness of the role and nature of (language) data in the age of Large Language Models (LLMs).</div></div>\",\"PeriodicalId\":47809,\"journal\":{\"name\":\"English for Specific Purposes\",\"volume\":\"79 \",\"pages\":\"Pages 70-86\"},\"PeriodicalIF\":3.2000,\"publicationDate\":\"2025-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"English for Specific Purposes\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S088949062500016X\",\"RegionNum\":1,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"English for Specific Purposes","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S088949062500016X","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"LINGUISTICS","Score":null,"Total":0}
Do-it-yourself corpora and discipline-specific writing: A focus on the benefits of corpus building and cleaning (in the age of large language models)
Within the field of data-driven learning (DDL), an increasing number of studies have underscored the benefits of creating small specialized DIY corpora for the teaching/learning of discipline-specific writing. However, in these studies, corpus cleaning is often described as an optional step in the process of corpus creation and is frequently presented as a tedious, unnecessary, and time-consuming task, with the majority of scholars calling for the creation of “quick and dirty” corpora. In this paper, we re-examine corpus creation, namely corpus cleaning and metadata construction, for discipline-specific writing. More specifically, our paper seeks to reframe corpus cleaning and metadata construction as meaningful and purposeful activities that increase learners’ awareness of disciplinary norms and conventions, particularly in a comparative context. We base our analysis on the reflections provided by learners from various disciplines who designed, compiled, cleaned, and analyzed corpora, along with the final papers they drafted for the courses they took in the Department of English at a teaching-focused research university in the Middle East. Corpus cleaning and metadata creation, as we hope to show, not only make visible the invisible writing conventions within disciplines, including the integration of evidence and raw data, but also position language learners as data engineers, promoting their critical awareness of the role and nature of (language) data in the age of Large Language Models (LLMs).
期刊介绍:
English For Specific Purposes is an international peer-reviewed journal that welcomes submissions from across the world. Authors are encouraged to submit articles and research/discussion notes on topics relevant to the teaching and learning of discourse for specific communities: academic, occupational, or otherwise specialized. Topics such as the following may be treated from the perspective of English for specific purposes: second language acquisition in specialized contexts, needs assessment, curriculum development and evaluation, materials preparation, discourse analysis, descriptions of specialized varieties of English.