对葡萄牙语中使用顺序内容独立特征的书面文章的修辞结构进行自动内容分析

R. F. Mello, G. Fiorentino, Hilário Oliveira, P. Miranda, Mladen Raković, D. Gašević
{"title":"对葡萄牙语中使用顺序内容独立特征的书面文章的修辞结构进行自动内容分析","authors":"R. F. Mello, G. Fiorentino, Hilário Oliveira, P. Miranda, Mladen Raković, D. Gašević","doi":"10.1145/3506860.3506977","DOIUrl":null,"url":null,"abstract":"Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features – i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.","PeriodicalId":185465,"journal":{"name":"LAK22: 12th International Learning Analytics and Knowledge Conference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in Portuguese\",\"authors\":\"R. F. Mello, G. Fiorentino, Hilário Oliveira, P. Miranda, Mladen Raković, D. Gašević\",\"doi\":\"10.1145/3506860.3506977\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features – i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.\",\"PeriodicalId\":185465,\"journal\":{\"name\":\"LAK22: 12th International Learning Analytics and Knowledge Conference\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"LAK22: 12th International Learning Analytics and Knowledge Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3506860.3506977\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"LAK22: 12th International Learning Analytics and Knowledge Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3506860.3506977","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

巴西的大学已经将论文写作作业纳入入学考试程序,以选拔未来的学生。论文阅卷人手动查找所需的修辞结构理论(RST)类别的存在,并评估文章的连贯性。然而,确定RST类别是一项耗时的任务。文献报道了几种用机器学习自动识别论文中RST类别的尝试。尽管如此,之前的研究主要集中在使用基于内容相关特征训练的机器学习算法,这可能会降低分类性能,导致过度拟合并阻碍模型的通用性。因此,本文提出:(1)分析当前最先进的分类器和RST修辞动作任务的内容无关特征;(ii)一种考虑文本序列来提取特征的新方法——即序列内容无关特征;(iii)关于这种情况下机器学习模型和顺序内容无关特征的通用性的实证研究;(iv)识别最具预测性的特征,以自动识别葡萄牙文写作的RST类别。性能最好的分类器XGBoost基于顺序内容无关的特征,优于文献中使用的基于传统内容相关特征的分类器。在对通用性进行测试时,基于顺序内容无关特征的XGBoost分类器也达到了令人满意的准确性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in Portuguese
Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features – i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信