N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina
{"title":"Text Classification by CEFR Levels Using Machine Learning Methods and the BERT Language Model","authors":"N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina","doi":"10.3103/S0146411624700329","DOIUrl":null,"url":null,"abstract":"<p>This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in a natural language is an important component of assessing a student’s knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models are considered based on the stylometric numerical features of the character, word, and sentence structure levels. The obtained vectors are classified by the standard machine learning classifiers. This article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, and LogisticRegression. Precision, comprehensiveness, and the F-measure served as the quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, are chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 are shown by the Support Vector Classifier with an F-score of 67% for the CEFR Levelled English Texts. This approach is compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided an F-score value of 69%. The analysis of classification errors shows that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depends on the text corpus, which demonstrates a significant difference in F-scores during the application of the same text models for different corpora. In general, the results obtained show the effectiveness of automatic text level determination and the possibility of its practical application.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"58 7","pages":"869 - 878"},"PeriodicalIF":0.6000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411624700329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in a natural language is an important component of assessing a student’s knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models are considered based on the stylometric numerical features of the character, word, and sentence structure levels. The obtained vectors are classified by the standard machine learning classifiers. This article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, and LogisticRegression. Precision, comprehensiveness, and the F-measure served as the quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, are chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 are shown by the Support Vector Classifier with an F-score of 67% for the CEFR Levelled English Texts. This approach is compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided an F-score value of 69%. The analysis of classification errors shows that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depends on the text corpus, which demonstrates a significant difference in F-scores during the application of the same text models for different corpora. In general, the results obtained show the effectiveness of automatic text level determination and the possibility of its practical application.
本文研究了按照国际CEFR标准对英语短文篇进行自动分类的问题。确定自然语言的文本水平是评估学生知识的重要组成部分,包括检查电子学习系统中的开放任务。为了解决这个问题,考虑了基于字符、单词和句子结构层次的文体特征的向量文本模型。得到的向量通过标准机器学习分类器进行分类。本文介绍了三个最成功的分类器的结果:支持向量分类器,随机梯度下降分类器和逻辑回归。准确性、全面性和f指标作为质量指标。实验选择了两个开放文本语料库,CEFR leveled English Texts和BEA-2019。支持向量分类器显示了六个CEFR级别和从A1到C2的子级别的最佳分类结果,CEFR水平英语文本的f得分为67%。将该方法与BERT语言模型(六种不同的变体)的应用进行了比较。最好的模型是bert-base-case, f值为69%。通过对分类误差的分析发现,大多数分类误差都在相邻的层次之间,这从领域的角度来看是可以理解的。此外,分类质量强烈依赖于文本语料库,这表明在不同语料库应用相同的文本模型时,f分数有显著差异。总的来说,实验结果表明了文本水平自动确定的有效性和实际应用的可能性。
期刊介绍:
Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision