用机器学习方法预测英语听力文本的CEFR水平

Research Methods in Applied Linguistics Pub Date : 2025-07-01 DOI:10.1016/j.rmal.2025.100234

Christopher Robert Cooper

{"title":"用机器学习方法预测英语听力文本的CEFR水平","authors":"Christopher Robert Cooper","doi":"10.1016/j.rmal.2025.100234","DOIUrl":null,"url":null,"abstract":"<div><div>Comprehension in listening texts is often judged by lexical coverage. However, this might not be easily interpretable for language teachers. The CEFR is becoming increasingly influential due to its standardized descriptors across languages. Learners are often placed into classes based on proficiency level, therefore a CEFR level is likely more interpretable than lexical coverage when judging listening text difficulty. Machine learning methods have been used to predict the CEFR level of English reading texts and learner writing, but no such studies exist for listening. The current study hopes to bridge this gap by investigating the potential to predict the CEFR level of listening texts. A corpus of CEFR-labelled listening texts (728 texts, 345,104 words) was compiled for text classification. Three types of variables were created from the corpus data to evaluate comparative predictive accuracy. The first method used linguistic and acoustic features. The others used text embeddings, which represent semantic meaning. The data was split into four classes: A1, A2, B1, and B2+. The accuracy of each method was evaluated by comparing the predicted label in the test data with the label from the original text. The most accurate method used OpenAI embeddings and Support Vector Machines. The overall accuracy was 0.81, with macro averages of precision = 0.75, recall = 0.78, and f-score = 0.76, indicating balanced classification performance across CEFR levels. This method has the potential to predict the CEFR level of listening texts, which could help practitioners and researchers match learners and participants to appropriate listening texts.</div></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"4 3","pages":"Article 100234"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting the CEFR level of English listening texts with machine learning methods\",\"authors\":\"Christopher Robert Cooper\",\"doi\":\"10.1016/j.rmal.2025.100234\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Comprehension in listening texts is often judged by lexical coverage. However, this might not be easily interpretable for language teachers. The CEFR is becoming increasingly influential due to its standardized descriptors across languages. Learners are often placed into classes based on proficiency level, therefore a CEFR level is likely more interpretable than lexical coverage when judging listening text difficulty. Machine learning methods have been used to predict the CEFR level of English reading texts and learner writing, but no such studies exist for listening. The current study hopes to bridge this gap by investigating the potential to predict the CEFR level of listening texts. A corpus of CEFR-labelled listening texts (728 texts, 345,104 words) was compiled for text classification. Three types of variables were created from the corpus data to evaluate comparative predictive accuracy. The first method used linguistic and acoustic features. The others used text embeddings, which represent semantic meaning. The data was split into four classes: A1, A2, B1, and B2+. The accuracy of each method was evaluated by comparing the predicted label in the test data with the label from the original text. The most accurate method used OpenAI embeddings and Support Vector Machines. The overall accuracy was 0.81, with macro averages of precision = 0.75, recall = 0.78, and f-score = 0.76, indicating balanced classification performance across CEFR levels. This method has the potential to predict the CEFR level of listening texts, which could help practitioners and researchers match learners and participants to appropriate listening texts.</div></div>\",\"PeriodicalId\":101075,\"journal\":{\"name\":\"Research Methods in Applied Linguistics\",\"volume\":\"4 3\",\"pages\":\"Article 100234\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research Methods in Applied Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772766125000552\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Methods in Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772766125000552","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

听力文本的理解通常是通过词汇覆盖来判断的。然而，对于语言教师来说，这可能不容易解释。CEFR由于其跨语言的标准化描述符而变得越来越有影响力。学习者通常根据熟练程度分组，因此在判断听力文本难度时，CEFR水平可能比词汇覆盖范围更容易解释。机器学习方法已经被用于预测英语阅读文本和学习者写作的CEFR水平，但还没有关于听力的研究。目前的研究希望通过调查预测听力文本的CEFR水平的潜力来弥合这一差距。编制了cefr标记听力文本语料库（728个文本，345,104个单词）进行文本分类。从语料库数据中创建了三种类型的变量来评估比较预测的准确性。第一种方法利用语言和声学特征。其他的则使用文本嵌入，表示语义。数据分为A1、A2、B1和B2+四类。通过比较测试数据中的预测标签与原始文本中的标签来评估每种方法的准确性。最准确的方法是使用OpenAI嵌入和支持向量机。总体准确率为0.81，宏观平均精度为0.75，召回率为0.78，f-score为0.76，表明CEFR各水平的分类性能平衡。该方法具有预测听力文本CEFR水平的潜力，可以帮助从业者和研究人员将学习者和参与者匹配到合适的听力文本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Predicting the CEFR level of English listening texts with machine learning methods

Comprehension in listening texts is often judged by lexical coverage. However, this might not be easily interpretable for language teachers. The CEFR is becoming increasingly influential due to its standardized descriptors across languages. Learners are often placed into classes based on proficiency level, therefore a CEFR level is likely more interpretable than lexical coverage when judging listening text difficulty. Machine learning methods have been used to predict the CEFR level of English reading texts and learner writing, but no such studies exist for listening. The current study hopes to bridge this gap by investigating the potential to predict the CEFR level of listening texts. A corpus of CEFR-labelled listening texts (728 texts, 345,104 words) was compiled for text classification. Three types of variables were created from the corpus data to evaluate comparative predictive accuracy. The first method used linguistic and acoustic features. The others used text embeddings, which represent semantic meaning. The data was split into four classes: A1, A2, B1, and B2+. The accuracy of each method was evaluated by comparing the predicted label in the test data with the label from the original text. The most accurate method used OpenAI embeddings and Support Vector Machines. The overall accuracy was 0.81, with macro averages of precision = 0.75, recall = 0.78, and f-score = 0.76, indicating balanced classification performance across CEFR levels. This method has the potential to predict the CEFR level of listening texts, which could help practitioners and researchers match learners and participants to appropriate listening texts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Research Methods in Applied Linguistics

CiteScore

4.10

自引率

0.00%

发文量