N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina
{"title":"Text Classification by CEFR Levels Using Machine Learning Methods and the BERT Language Model","authors":"N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina","doi":"10.3103/S0146411624700329","DOIUrl":null,"url":null,"abstract":"<p>This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in a natural language is an important component of assessing a student’s knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models are considered based on the stylometric numerical features of the character, word, and sentence structure levels. The obtained vectors are classified by the standard machine learning classifiers. This article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, and LogisticRegression. Precision, comprehensiveness, and the F-measure served as the quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, are chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 are shown by the Support Vector Classifier with an F-score of 67% for the CEFR Levelled English Texts. This approach is compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided an F-score value of 69%. The analysis of classification errors shows that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depends on the text corpus, which demonstrates a significant difference in F-scores during the application of the same text models for different corpora. In general, the results obtained show the effectiveness of automatic text level determination and the possibility of its practical application.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"58 7","pages":"869 - 878"},"PeriodicalIF":0.6000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411624700329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in a natural language is an important component of assessing a student’s knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models are considered based on the stylometric numerical features of the character, word, and sentence structure levels. The obtained vectors are classified by the standard machine learning classifiers. This article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, and LogisticRegression. Precision, comprehensiveness, and the F-measure served as the quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, are chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 are shown by the Support Vector Classifier with an F-score of 67% for the CEFR Levelled English Texts. This approach is compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided an F-score value of 69%. The analysis of classification errors shows that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depends on the text corpus, which demonstrates a significant difference in F-scores during the application of the same text models for different corpora. In general, the results obtained show the effectiveness of automatic text level determination and the possibility of its practical application.
期刊介绍:
Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision