{"title":"Aligning linguistic complexity with the difficulty of English texts for L2 learners based on CEFR levels","authors":"Xiaopeng Zhang, Xiaofei Lu","doi":"10.1017/s0272263125101125","DOIUrl":null,"url":null,"abstract":"<p>Selecting appropriate texts for second language (L2) learners is essential for effective education. However, current text difficulty models often inadequately classify materials for L2 learners by proficiency levels. This study addresses this deficiency by employing the Common European Framework of Reference for Languages (CEFR) as its foundational framework. A cohort of expert English-L2 educators classified 1,181 texts from the CommonLit Ease of Readability corpus into CEFR levels. A random forest model was then trained using 24 linguistic complexity features to predict the CEFR levels of English texts for L2 learners. The model achieved 62.6% exact-level accuracy across the six granular CEFR levels and 82.6% across the three overarching levels, outperforming a baseline model based on three existing readability formulas. Additionally, it identified shared and unique linguistic features across different CEFR levels, highlighting the necessity to adjust text classification models to accommodate the distinct linguistic profiles of low- and high-proficiency readers.</p>","PeriodicalId":22008,"journal":{"name":"Studies in Second Language Acquisition","volume":"52 1","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in Second Language Acquisition","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1017/s0272263125101125","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Selecting appropriate texts for second language (L2) learners is essential for effective education. However, current text difficulty models often inadequately classify materials for L2 learners by proficiency levels. This study addresses this deficiency by employing the Common European Framework of Reference for Languages (CEFR) as its foundational framework. A cohort of expert English-L2 educators classified 1,181 texts from the CommonLit Ease of Readability corpus into CEFR levels. A random forest model was then trained using 24 linguistic complexity features to predict the CEFR levels of English texts for L2 learners. The model achieved 62.6% exact-level accuracy across the six granular CEFR levels and 82.6% across the three overarching levels, outperforming a baseline model based on three existing readability formulas. Additionally, it identified shared and unique linguistic features across different CEFR levels, highlighting the necessity to adjust text classification models to accommodate the distinct linguistic profiles of low- and high-proficiency readers.
期刊介绍:
Studies in Second Language Acquisition is a refereed journal of international scope devoted to the scientific discussion of acquisition or use of non-native and heritage languages. Each volume (five issues) contains research articles of either a quantitative, qualitative, or mixed-methods nature in addition to essays on current theoretical matters. Other rubrics include shorter articles such as Replication Studies, Critical Commentaries, and Research Reports.