{"title":"应用多面 Rasch 测量法评估自动作文评分:以 ChatGPT-4.0 为例","authors":"Taichi Yamashita","doi":"10.1016/j.rmal.2024.100133","DOIUrl":null,"url":null,"abstract":"<div><p>Automated essay scoring (AES) has the great potential to reduce human raters’ workload while providing feedback for language learners. Given that scores from AES tools can impact stakeholders’ decision-making, independent researchers’ evaluation is essential. For this purpose, AES tools have been evaluated primarily in terms of their alignment with human raters by the use of correlation and agreement indices. The present study aimed to showcase the potential of many-facet Rasch measurement (MFRM) as another approach to evaluate ChatGPT-4.0 as an AES tool. Capitalizing on the International Corpus Network of Asian Learners of English (ICNALE), the study used 80 human raters’ ratings for 136 argumentative essays written by English language learners in Asian regions. Additional data were collected by asking ChatGPT-4.0 to assign scores for the 136 essays. It was found that ChatGPT-4.0 distinguished essays written by three proficiency groups on the CEFR scale recorded in the ICNALE as human raters did. Correlations between human raters’ ratings and ChatGPT-4.0′s ratings were moderate to strong (<em>r</em> = 0.67–.82), and only half of their ratings were identical. Furthermore, ChatGPT-4.0′s severity level was comparable with human raters’, and ChatGPT-4.0′s ratings were extremely consistent within itself, rendering it difficult to tease apart variance in its ratings from the measurement perspective. Neither human raters nor ChatGPT-4.0 exercised significant biases towards writers’ gender. These findings indicate the potential of ChatGPT-4.0 as an AES tool while highlighting the benefits of MFRM as an approach that complements correlation and agreement indices.</p></div>","PeriodicalId":101075,"journal":{"name":"Research Methods in Applied Linguistics","volume":"3 3","pages":"Article 100133"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0\",\"authors\":\"Taichi Yamashita\",\"doi\":\"10.1016/j.rmal.2024.100133\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Automated essay scoring (AES) has the great potential to reduce human raters’ workload while providing feedback for language learners. Given that scores from AES tools can impact stakeholders’ decision-making, independent researchers’ evaluation is essential. For this purpose, AES tools have been evaluated primarily in terms of their alignment with human raters by the use of correlation and agreement indices. The present study aimed to showcase the potential of many-facet Rasch measurement (MFRM) as another approach to evaluate ChatGPT-4.0 as an AES tool. Capitalizing on the International Corpus Network of Asian Learners of English (ICNALE), the study used 80 human raters’ ratings for 136 argumentative essays written by English language learners in Asian regions. Additional data were collected by asking ChatGPT-4.0 to assign scores for the 136 essays. It was found that ChatGPT-4.0 distinguished essays written by three proficiency groups on the CEFR scale recorded in the ICNALE as human raters did. Correlations between human raters’ ratings and ChatGPT-4.0′s ratings were moderate to strong (<em>r</em> = 0.67–.82), and only half of their ratings were identical. Furthermore, ChatGPT-4.0′s severity level was comparable with human raters’, and ChatGPT-4.0′s ratings were extremely consistent within itself, rendering it difficult to tease apart variance in its ratings from the measurement perspective. Neither human raters nor ChatGPT-4.0 exercised significant biases towards writers’ gender. These findings indicate the potential of ChatGPT-4.0 as an AES tool while highlighting the benefits of MFRM as an approach that complements correlation and agreement indices.</p></div>\",\"PeriodicalId\":101075,\"journal\":{\"name\":\"Research Methods in Applied Linguistics\",\"volume\":\"3 3\",\"pages\":\"Article 100133\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research Methods in Applied Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772766124000399\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Methods in Applied Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772766124000399","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0
Automated essay scoring (AES) has the great potential to reduce human raters’ workload while providing feedback for language learners. Given that scores from AES tools can impact stakeholders’ decision-making, independent researchers’ evaluation is essential. For this purpose, AES tools have been evaluated primarily in terms of their alignment with human raters by the use of correlation and agreement indices. The present study aimed to showcase the potential of many-facet Rasch measurement (MFRM) as another approach to evaluate ChatGPT-4.0 as an AES tool. Capitalizing on the International Corpus Network of Asian Learners of English (ICNALE), the study used 80 human raters’ ratings for 136 argumentative essays written by English language learners in Asian regions. Additional data were collected by asking ChatGPT-4.0 to assign scores for the 136 essays. It was found that ChatGPT-4.0 distinguished essays written by three proficiency groups on the CEFR scale recorded in the ICNALE as human raters did. Correlations between human raters’ ratings and ChatGPT-4.0′s ratings were moderate to strong (r = 0.67–.82), and only half of their ratings were identical. Furthermore, ChatGPT-4.0′s severity level was comparable with human raters’, and ChatGPT-4.0′s ratings were extremely consistent within itself, rendering it difficult to tease apart variance in its ratings from the measurement perspective. Neither human raters nor ChatGPT-4.0 exercised significant biases towards writers’ gender. These findings indicate the potential of ChatGPT-4.0 as an AES tool while highlighting the benefits of MFRM as an approach that complements correlation and agreement indices.