人工智能时代的自动评分：对土耳其文章的实证研究

IF 5.6 1区文学 Q1 EDUCATION & EDUCATIONAL RESEARCH

System Pub Date : 2025-07-21 DOI:10.1016/j.system.2025.103784

Burak Aydın , Tarık Kışla , Nursel Tan Elmas , Okan Bulut

{"title":"人工智能时代的自动评分：对土耳其文章的实证研究","authors":"Burak Aydın , Tarık Kışla , Nursel Tan Elmas , Okan Bulut","doi":"10.1016/j.system.2025.103784","DOIUrl":null,"url":null,"abstract":"<div><div>Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.</div></div>","PeriodicalId":48185,"journal":{"name":"System","volume":"133 ","pages":"Article 103784"},"PeriodicalIF":5.6000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays\",\"authors\":\"Burak Aydın , Tarık Kışla , Nursel Tan Elmas , Okan Bulut\",\"doi\":\"10.1016/j.system.2025.103784\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.</div></div>\",\"PeriodicalId\":48185,\"journal\":{\"name\":\"System\",\"volume\":\"133 \",\"pages\":\"Article 103784\"},\"PeriodicalIF\":5.6000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"System\",\"FirstCategoryId\":\"98\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0346251X25001940\",\"RegionNum\":1,\"RegionCategory\":\"文学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"System","FirstCategoryId":"98","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0346251X25001940","RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

自动评分（AS）作为一种提高评估过程的效率和可靠性的工具已经获得了极大的关注。然而，它在土耳其语等代表性不足的语言中的应用仍然有限。本研究通过使用由OpenAI的gpt - 40提供支持的标题的零射击方法对土耳其语进行实证评估，解决了这一差距。由土耳其语作为第二语言的学习者撰写的590篇文章组成的数据集由专业的人类评分员和通过定制界面集成的人工智能（AI）模型进行评分。评分标准以欧洲共同语言参考框架为基础，评估了写作质量的六个方面。结果显示，人类和人工智能的得分高度一致，二次加权Kappa为0.72，Pearson相关性为0.73，重叠度为83.5%。评价者效应的分析显示，尽管经验和性别等因素表现出适度的影响，但对得分差异的影响微乎其微。这些发现证明了土耳其语中人工智能驱动评分的潜力，为在代表性不足的语言中更广泛地实施提供了有价值的见解，例如人类和人工智能评分之间可能存在分歧的原因。从一个特定的写作任务中得出的结论强调了未来研究探索不同输入和多个评分者的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automated scoring in the era of artificial intelligence: An empirical study with Turkish essays

Automated scoring (AS) has gained significant attention as a tool to enhance the efficiency and reliability of assessment processes. Yet, its application in under-represented languages, such as Turkish, remains limited. This study addresses this gap by empirically evaluating AS for Turkish using a zero-shot approach with a rubric powered by OpenAI's GPT-4o. A dataset of 590 essays written by learners of Turkish as a second language was scored by professional human raters and an artificial intelligence (AI) model integrated via a custom-built interface. The scoring rubric, grounded in the Common European Framework of Reference for Languages, assessed six dimensions of writing quality. Results revealed a strong alignment between human and AI scores with a Quadratic Weighted Kappa of 0.72, Pearson correlation of 0.73, and an overlap measure of 83.5 %. Analysis of rater effects showed minimal influence on score discrepancies, though factors such as experience and gender exhibited modest effects. These findings demonstrate the potential of AI-driven scoring in Turkish, offering valuable insights for broader implementation in under-represented languages, such as the possible source of disagreements between human and AI scores. Conclusions from a specific writing task with a single human rater underscore the need for future research to explore diverse inputs and multiple raters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

System Multiple-

CiteScore

8.80

自引率

8.30%

发文量

202

审稿时长

64 days

期刊介绍： This international journal is devoted to the applications of educational technology and applied linguistics to problems of foreign language teaching and learning. Attention is paid to all languages and to problems associated with the study and teaching of English as a second or foreign language. The journal serves as a vehicle of expression for colleagues in developing countries. System prefers its contributors to provide articles which have a sound theoretical base with a visible practical application which can be generalized. The review section may take up works of a more theoretical nature to broaden the background.