{"title":"当人工智能满足源使用:探索ChatGPT在第二语言摘要写作评估中的潜力","authors":"Haeyun Jin","doi":"10.1016/j.system.2025.103737","DOIUrl":null,"url":null,"abstract":"<div><div>Integrated writing assessments, which require students to incorporate source material into their writing, pose unique challenges for human raters. Understanding how AI tools like ChatGPT perform in assessing such tasks has become critical. This study investigates ChatGPT's ability to score L2 summary writing compared to human raters, focusing on the differences in rating results across various writing criteria, and their decision-making process in assessing source use. Using Many-Facet Rasch Measurement (MFRM) analysis, ratings of 90 student essays by GPT_original, GPT_calibrated, and two human raters were analyzed. Results indicated that GPT_original was the strictest rater overall, particularly in language-focused criteria. While GPT_calibrated aligned more closely with human raters, it still exhibited significant gaps in assessing a source-use-related criterion. Qualitative analyses of raters' think-aloud protocols revealed ChatGPT's detailed, rule-based approach to identifying source use strategies but also its lack of contextual flexibility, often misjudging legitimate paraphrasing attempts and over-relying on surface-level cues. These findings highlight ChatGPT's potential as a supplementary rating tool for L2 integrated writing while underscoring its limitations in addressing the developmental and contextual aspects of assessing source use. Implications point to the need for further refinement through L2-specific training to better align ChatGPT's judgments with human standards.</div></div>","PeriodicalId":48185,"journal":{"name":"System","volume":"133 ","pages":"Article 103737"},"PeriodicalIF":4.9000,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"When AI meets source use: Exploring ChatGPT's potential in L2 summary writing assessment\",\"authors\":\"Haeyun Jin\",\"doi\":\"10.1016/j.system.2025.103737\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Integrated writing assessments, which require students to incorporate source material into their writing, pose unique challenges for human raters. Understanding how AI tools like ChatGPT perform in assessing such tasks has become critical. This study investigates ChatGPT's ability to score L2 summary writing compared to human raters, focusing on the differences in rating results across various writing criteria, and their decision-making process in assessing source use. Using Many-Facet Rasch Measurement (MFRM) analysis, ratings of 90 student essays by GPT_original, GPT_calibrated, and two human raters were analyzed. Results indicated that GPT_original was the strictest rater overall, particularly in language-focused criteria. While GPT_calibrated aligned more closely with human raters, it still exhibited significant gaps in assessing a source-use-related criterion. Qualitative analyses of raters' think-aloud protocols revealed ChatGPT's detailed, rule-based approach to identifying source use strategies but also its lack of contextual flexibility, often misjudging legitimate paraphrasing attempts and over-relying on surface-level cues. These findings highlight ChatGPT's potential as a supplementary rating tool for L2 integrated writing while underscoring its limitations in addressing the developmental and contextual aspects of assessing source use. Implications point to the need for further refinement through L2-specific training to better align ChatGPT's judgments with human standards.</div></div>\",\"PeriodicalId\":48185,\"journal\":{\"name\":\"System\",\"volume\":\"133 \",\"pages\":\"Article 103737\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"System\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0346251X25001472\",\"RegionNum\":1,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"System","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0346251X25001472","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
When AI meets source use: Exploring ChatGPT's potential in L2 summary writing assessment
Integrated writing assessments, which require students to incorporate source material into their writing, pose unique challenges for human raters. Understanding how AI tools like ChatGPT perform in assessing such tasks has become critical. This study investigates ChatGPT's ability to score L2 summary writing compared to human raters, focusing on the differences in rating results across various writing criteria, and their decision-making process in assessing source use. Using Many-Facet Rasch Measurement (MFRM) analysis, ratings of 90 student essays by GPT_original, GPT_calibrated, and two human raters were analyzed. Results indicated that GPT_original was the strictest rater overall, particularly in language-focused criteria. While GPT_calibrated aligned more closely with human raters, it still exhibited significant gaps in assessing a source-use-related criterion. Qualitative analyses of raters' think-aloud protocols revealed ChatGPT's detailed, rule-based approach to identifying source use strategies but also its lack of contextual flexibility, often misjudging legitimate paraphrasing attempts and over-relying on surface-level cues. These findings highlight ChatGPT's potential as a supplementary rating tool for L2 integrated writing while underscoring its limitations in addressing the developmental and contextual aspects of assessing source use. Implications point to the need for further refinement through L2-specific training to better align ChatGPT's judgments with human standards.
期刊介绍:
This international journal is devoted to the applications of educational technology and applied linguistics to problems of foreign language teaching and learning. Attention is paid to all languages and to problems associated with the study and teaching of English as a second or foreign language. The journal serves as a vehicle of expression for colleagues in developing countries. System prefers its contributors to provide articles which have a sound theoretical base with a visible practical application which can be generalized. The review section may take up works of a more theoretical nature to broaden the background.