R. F. Mello, G. Fiorentino, Hilário Oliveira, P. Miranda, Mladen Raković, D. Gašević
{"title":"对葡萄牙语中使用顺序内容独立特征的书面文章的修辞结构进行自动内容分析","authors":"R. F. Mello, G. Fiorentino, Hilário Oliveira, P. Miranda, Mladen Raković, D. Gašević","doi":"10.1145/3506860.3506977","DOIUrl":null,"url":null,"abstract":"Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features – i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.","PeriodicalId":185465,"journal":{"name":"LAK22: 12th International Learning Analytics and Knowledge Conference","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in Portuguese\",\"authors\":\"R. F. Mello, G. Fiorentino, Hilário Oliveira, P. Miranda, Mladen Raković, D. Gašević\",\"doi\":\"10.1145/3506860.3506977\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features – i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.\",\"PeriodicalId\":185465,\"journal\":{\"name\":\"LAK22: 12th International Learning Analytics and Knowledge Conference\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"LAK22: 12th International Learning Analytics and Knowledge Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3506860.3506977\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"LAK22: 12th International Learning Analytics and Knowledge Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3506860.3506977","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in Portuguese
Brazilian universities have included essay writing assignments in the entrance examination procedure to select prospective students. The essay scorers manually look for the presence of required Rhetorical Structure Theory (RST) categories and evaluate essay coherence. However, identifying RST categories is a time-consuming task. The literature reported several attempts to automate the identification of RST categories in essays with machine learning. Still, previous studies have focused on using machine learning algorithms trained on content-dependent features that can diminish classification performance, leading to over-fitting and hindering model generalisability. Therefore, this paper proposes: (i) the analysis of state-of-the-art classifiers and content-independent features to the task of RST rhetorical moves; (ii) a new approach that considers the sequence of the text to extract features – i.e. sequential content-independent features; (iii) an empirical study about the generalisability of the machine learning models and sequential content-independent features for this context; (iv) the identification of the most predictive features for automated identification of RST categories in essays written in Portuguese. The best performing classifier, XGBoost, based on sequential content-independent features, outperformed the classifiers used in the literature and are based on traditional content-dependent features. The XGBoost classifier based on sequential content-independent features also reached promising accuracy when tested for generalisability.