Text classification by CEFR levels using machine learning methods and BERT language model

Nadezhda S. Lagutina, Ksenia V. Lagutina, Anastasya M. Brederman, Natalia N. Kasatkina
{"title":"Text classification by CEFR levels using machine learning methods and BERT language model","authors":"Nadezhda S. Lagutina, Ksenia V. Lagutina, Anastasya M. Brederman, Natalia N. Kasatkina","doi":"10.18255/1818-1015-2023-3-202-213","DOIUrl":null,"url":null,"abstract":"This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.","PeriodicalId":31017,"journal":{"name":"Modelirovanie i Analiz Informacionnyh Sistem","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Modelirovanie i Analiz Informacionnyh Sistem","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18255/1818-1015-2023-3-202-213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This paper presents a study of the problem of automatic classification of short coherent texts (essays) in English according to the levels of the international CEFR scale. Determining the level of text in natural language is an important component of assessing students knowledge, including checking open tasks in e-learning systems. To solve this problem, vector text models were considered based on stylometric numerical features of the character, word, sentence structure levels. The classification of the obtained vectors was carried out by standard machine learning classifiers. The article presents the results of the three most successful ones: Support Vector Classifier, Stochastic Gradient Descent Classifier, LogisticRegression. Precision, recall and F-score served as quality measures. Two open text corpora, CEFR Levelled English Texts and BEA-2019, were chosen for the experiments. The best classification results for six CEFR levels and sublevels from A1 to C2 were shown by the Support Vector Classifier with F-score 67 % for the CEFR Levelled English Texts. This approach was compared with the application of the BERT language model (six different variants). The best model, bert-base-cased, provided the F-score value of 69 %. The analysis of classification errors showed that most of them are between neighboring levels, which is quite understandable from the point of view of the domain. In addition, the quality of classification strongly depended on the text corpus, that demonstrated a significant difference in F-scores during application of the same text models for different corpora. In general, the obtained results showed the effectiveness of automatic text level detection and the possibility of its practical application.
使用机器学习方法和BERT语言模型对CEFR水平进行文本分类
本文研究了按照国际CEFR标准对英语短文篇进行自动分类的问题。确定自然语言文本的水平是评估学生知识的重要组成部分,包括检查电子学习系统中的开放任务。为了解决这一问题,提出了基于字符、单词、句子结构层次的文体特征的向量文本模型。通过标准机器学习分类器对得到的向量进行分类。本文介绍了三个最成功的分类器的结果:支持向量分类器,随机梯度下降分类器,逻辑回归。精密度、召回率和f分作为质量衡量标准。实验选择了两个开放文本语料库,CEFR leveled English Texts和BEA-2019。支持向量分类器对CEFR分级英语文本的6个等级和A1至C2亚等级的分类结果最好,f值为67%。将该方法与BERT语言模型(六种不同的变体)的应用进行了比较。最佳模型bert-base-case的f值为69%。对分类误差的分析表明,大多数分类误差都在相邻的层次之间,这从领域的角度来看是可以理解的。此外,分类质量强烈依赖于文本语料库,这表明在不同语料库应用相同的文本模型时,f分有显著差异。总的来说,得到的结果表明了文本水平自动检测的有效性和实际应用的可能性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
18
审稿时长
8 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信