Text Classification by CEFR Levels Using Machine Learning Methods and the BERT Language Model

IF 0.6 Q4 AUTOMATION & CONTROL SYSTEMS
N. S. Lagutina, K. V. Lagutina, A. M. Brederman, N. N. Kasatkina
{"title":"Text Classification by CEFR Levels Using Machine Learning Methods and the BERT Language Model","authors":"N. S. Lagutina,&nbsp;K. V. Lagutina,&nbsp;A. M. Brederman,&nbsp;N. N. Kasatkina","doi":"10.3103/S0146411624700329","DOIUrl":null,"url":null,"abstract":"<p>This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in a natural language is an important component of assessing a student’s knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models are considered based on the stylometric numerical features of the character, word, and sentence structure levels. The obtained vectors are classified by the standard machine learning classifiers. This article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, and LogisticRegression. Precision, comprehensiveness, and the F-measure served as the quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, are chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 are shown by the Support Vector Classifier with an F-score of 67% for the CEFR Levelled English Texts. This approach is compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided an F-score value of 69%. The analysis of classification errors shows that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depends on the text corpus, which demonstrates a significant difference in F-scores during the application of the same text models for different corpora. In general, the results obtained show the effectiveness of automatic text level determination and the possibility of its practical application.</p>","PeriodicalId":46238,"journal":{"name":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","volume":"58 7","pages":"869 - 878"},"PeriodicalIF":0.6000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AUTOMATIC CONTROL AND COMPUTER SCIENCES","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.3103/S0146411624700329","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in a natural language is an important component of assessing a student’s knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models are considered based on the stylometric numerical features of the character, word, and sentence structure levels. The obtained vectors are classified by the standard machine learning classifiers. This article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, and LogisticRegression. Precision, comprehensiveness, and the F-measure served as the quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, are chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 are shown by the Support Vector Classifier with an F-score of 67% for the CEFR Levelled English Texts. This approach is compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided an F-score value of 69%. The analysis of classification errors shows that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depends on the text corpus, which demonstrates a significant difference in F-scores during the application of the same text models for different corpora. In general, the results obtained show the effectiveness of automatic text level determination and the possibility of its practical application.

Abstract Image

求助全文
约1分钟内获得全文 求助全文
来源期刊
AUTOMATIC CONTROL AND COMPUTER SCIENCES
AUTOMATIC CONTROL AND COMPUTER SCIENCES AUTOMATION & CONTROL SYSTEMS-
CiteScore
1.70
自引率
22.20%
发文量
47
期刊介绍: Automatic Control and Computer Sciences is a peer reviewed journal that publishes articles on• Control systems, cyber-physical system, real-time systems, robotics, smart sensors, embedded intelligence • Network information technologies, information security, statistical methods of data processing, distributed artificial intelligence, complex systems modeling, knowledge representation, processing and management • Signal and image processing, machine learning, machine perception, computer vision
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信