Assessing WritingPub Date : 2026-04-01Epub Date: 2026-01-23DOI: 10.1016/j.asw.2026.101015
Deborah K. Reed , Sterett Mercer
{"title":"Associations of adolescents’ argumentative writing scores and growth when evaluated by different human raters and artificial intelligence models","authors":"Deborah K. Reed , Sterett Mercer","doi":"10.1016/j.asw.2026.101015","DOIUrl":"10.1016/j.asw.2026.101015","url":null,"abstract":"<div><div>Having accurate and timely information on students’ argumentative writing skills is necessary to foster their improvement. Therefore, this study explored the associations among two different human rater types (researchers and teachers) and four different artificial intelligence (AI) models when using an analytic rubric to score eighth and ninth graders' argumentative writing gathered at two time points. Results revealed that, at each time point, AI models and teacher raters generally assigned higher scores than researchers. There tended to be moderate agreement of AI model scores with researcher scores that generally exceeded the agreement of teacher scores with researchers. However, there also tended to be greater agreement among the scores from AI models compared to the agreement of AI scores with either human rater type. There were some differences by writing trait and by AI model, suggesting ChatGPT-4o scores had the strongest association with researcher scores particularly for the task and development traits. When examining growth over time, the teacher and researcher scores demonstrated positive growth, whereas the AI scores generally suggested students declined. Again, the AI models had more concordance to each other than to human estimations of growth, but there was slightly more agreement in the estimates when comparing researchers to AI than to teachers. Implications for data use and interpretation are discussed.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"68 ","pages":"Article 101015"},"PeriodicalIF":5.5,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146025937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-04-01Epub Date: 2026-01-21DOI: 10.1016/j.asw.2026.101014
Frederike Strahl , Jörg Kilian , Jens Möller
{"title":"From spelling to content: The influence of spelling quality on text assessment","authors":"Frederike Strahl , Jörg Kilian , Jens Möller","doi":"10.1016/j.asw.2026.101014","DOIUrl":"10.1016/j.asw.2026.101014","url":null,"abstract":"<div><h3>Background</h3><div>Text assessments by different teachers can lack objectivity, with some text characteristics influencing the evaluation of others.</div></div><div><h3>Aims</h3><div>This research explored the halo effect of spelling quality on first language text assessment and investigated whether a prompt could reduce this effect.</div></div><div><h3>Samples</h3><div>Study 1 involved 134 pre-service teachers and 83 in-service teachers. Study 2 included 130 in-service teachers, who were divided into three groups.</div></div><div><h3>Methods</h3><div>In Study 1, participants assessed six student texts with varying overall and spelling quality on a holistic, content, style, and linguistic accuracy scale. Study 2 replicated Study 1, but additionally involved two prompting groups, emphasizing independent assessments of text characteristics.</div></div><div><h3>Results</h3><div>In both studies, texts with better spelling quality received higher ratings on spelling independent and spelling dependent scales, demonstrating successful manipulation and halo effects of spelling quality on content and style assessments. Study 2 showed robust halo effects in first language assessment, even after prompting.</div></div><div><h3>Conclusion</h3><div>The studies showed halo effects of spelling quality in text assessment, which remained stable even when prompted to reduce them. Future research should test more detailed interventions to reduce halo effects in text assessment.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"68 ","pages":"Article 101014"},"PeriodicalIF":5.5,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146025936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-04-01Epub Date: 2026-04-06DOI: 10.1016/j.asw.2026.101040
Yan Liang , Yueh Yea Lo
{"title":"AI tools vs human in english narrative writing scoring: A comparative study of scoring agreement among human raters, DeepSeek and ChatGPT in exam context","authors":"Yan Liang , Yueh Yea Lo","doi":"10.1016/j.asw.2026.101040","DOIUrl":"10.1016/j.asw.2026.101040","url":null,"abstract":"<div><div>A growing body of literature is examining the agreement between artificial intelligence (AI) and humans in scoring. However, empirical findings remain inconclusive, ranging from a strong correlation to a large difference in the scores assigned by the two media. In response, this mixed-methods study compared scoring agreement and evaluation rationales across three sources, including ChatGPT-4, DeepSeek-R1 and human ratings from a city-level standarised English narrative writing examination (n = 212). Quantitatively, the findings revealed strong relative agreement between the two AI models, while only moderate alignment was observed between each AI tool and the human ratings. Qualitative analyses of a focus group interview and scoring rationales indicated that these discrepancies reflect systematic differences in evaluation orientation. Human raters demonstrated more flexibility and sensitivity to contextual and presentation-related factors in evaluating writing quality, whereas AI tools prioritised linguistic accuracy and structural regularity and remained unaffected by surface features such as handwriting or erasures. Rather than suggesting simple scoring errors, the results reveal systematic differences between human and AI judgment, with implications for the principled integration of AI into high-stakes assessment.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"68 ","pages":"Article 101040"},"PeriodicalIF":5.5,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147649898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-04-01Epub Date: 2026-01-20DOI: 10.1016/j.asw.2026.101013
Haiying Feng , Lawrence Jun Zhang , Kexin Li
{"title":"Conceptual, rhetorical and linguistic transformations: Assessing L2 literature review writing using simulated tasks","authors":"Haiying Feng , Lawrence Jun Zhang , Kexin Li","doi":"10.1016/j.asw.2026.101013","DOIUrl":"10.1016/j.asw.2026.101013","url":null,"abstract":"<div><div>This research addresses a critical gap in assessing literature review (LR) writing among L2 learner writers in terms of knowledge transformation. By integrating three strands of literature—genre analysis, source-based writing, and discourse synthesis—this study develops a framework that conceptualizes knowledge transformation along three dimensions: conceptual, rhetorical, and linguistic. It examines transforming performance of 125 Chinese students across bachelor, master, and doctoral levels through two simulated LR writing tasks, and explores the influence of factors including language proficiency and prior genre knowledge on their performance. Furthermore, it reveals the intricate interacting relationship among six transforming indicators (i.e., content units included, accuracy of information representation, extent of source integration, academic critiques and niche construction, verbatim copying, and lexico-grammatical errors). The results of textual and statistical analysis offer pedagogical implications, including enhancing compare-and-contrast practice, cultivating awareness of niche construction, and reducing L2 writers’ cognitive load in linguistic transformation.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"68 ","pages":"Article 101013"},"PeriodicalIF":5.5,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145996550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-04-01Epub Date: 2026-03-30DOI: 10.1016/j.asw.2026.101041
Shadi I. Abudalfa , Jessie S. Barrot
{"title":"Generative artificial intelligence for automated writing evaluation: A systematic review of trends, efficacy, and challenges","authors":"Shadi I. Abudalfa , Jessie S. Barrot","doi":"10.1016/j.asw.2026.101041","DOIUrl":"10.1016/j.asw.2026.101041","url":null,"abstract":"<div><div>Generative artificial intelligence (GenAI) has been increasingly integrated into education, but its application in automated writing evaluation (AWE) remains underexplored. The absence of a comprehensive synthesis leaves unresolved questions about their effectiveness, reliability, and ethical implications in high-stakes contexts. This study addresses that gap through a systematic review of 96 empirical works on GenAI in AWE, consolidating evidence on efficacy, challenges, and pedagogical value. Following PRISMA guidelines and guided by the CIMO framework, the review combined quantitative descriptive mapping with qualitative thematic synthesis to evaluate interventions, mechanisms, and outcomes. The analysis shows that GenAI systems perform efficiently in grammar correction, coherence, and rubric-based tasks while remaining inconsistent in assessing higher-order skills such as argumentation and creativity. Findings also reveal conditional effectiveness, with outcomes shaped by context, task design, and prompting strategies, alongside mixed impacts on learner motivation and teacher workload. Despite aligning with human raters on surface-level accuracy, unresolved concerns regarding validity, equity, and academic integrity limit their readiness for high-stakes assessment. The review concludes that GenAI in AWE represents both promise and risk, requiring careful integration and human oversight. These insights carry significant implications for theory, pedagogy, and methodology.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"68 ","pages":"Article 101041"},"PeriodicalIF":5.5,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147600054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-04-01Epub Date: 2026-03-31DOI: 10.1016/j.asw.2026.101042
Syed Muhammad Mujtaba , Tiefu Zhang , Natalia Sletova
{"title":"Task complexity, collaborative writing, and learner engagement: Examining second language learners’ writing performance","authors":"Syed Muhammad Mujtaba , Tiefu Zhang , Natalia Sletova","doi":"10.1016/j.asw.2026.101042","DOIUrl":"10.1016/j.asw.2026.101042","url":null,"abstract":"<div><div>Collaboration and learner engagement have been found to positively influence second language (L2) writing development. However, few studies have examined how tasks of differing cognitive demands, such as narrative (simpler) versus argumentative (more complex), affect collaborative writing performance and learner engagement. To address this gap, two intact undergraduate classes in an EAP course were assigned to either the individual or the collaborative writing condition. Participants completed narrative and argumentative tasks, with writing performance measured through syntactic complexity, accuracy, lexical complexity, and fluency (CALF), as well as cognitive and behavioural engagement. The collaborative condition consistently outperformed the individual condition in accuracy and fluency across tasks. Within both conditions, the complex task was associated with higher fluency, while accuracy was higher in the simple task. Task complexity had no significant effect on syntactic or lexical complexity. Additionally, task complexity did not significantly alter cognitive or behavioural engagement in the individual condition; however, it significantly increased behavioural engagement, measured as time on task, in the collaborative condition. The findings support the pedagogical value of integrating collaborative writing tasks with different cognitive demands in L2 instruction.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"68 ","pages":"Article 101042"},"PeriodicalIF":5.5,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147600053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-01-01Epub Date: 2025-12-05DOI: 10.1016/j.asw.2025.100995
Christian Holmberg Sjöling
{"title":"The relation between linguistic accuracy and scoring of Swedish EFL students’ writing during a high-stakes exam","authors":"Christian Holmberg Sjöling","doi":"10.1016/j.asw.2025.100995","DOIUrl":"10.1016/j.asw.2025.100995","url":null,"abstract":"<div><div>This paper examines the effect of linguistic accuracy (e.g., the lack of form, grammatical, and lexical errors) on scoring during the high-stakes national test of English in Swedish upper secondary school. Teachers are expected to score their own students’ texts with the help of assessment instructions containing benchmark texts (i.e., texts representing different score bands). The assessment instructions and the score bands provided to guide scoring are not explicit about how accuracy should influence scores. Two research questions were answered: As measured by ordinal regression, to what extent does linguistic accuracy predict rater scores? Do the texts scored by teachers reflect the graded example texts in terms of how linguistic accuracy predicts scores? The results revealed, amongst other things, that overall frequency of errors in texts significantly predicted scores as the model explained approximately 58 % of the variance in the outcome variable according to Nagelkerke’s pseudo R-squared. Accuracy also had a similar effect on scores in texts rated by teachers as in the benchmark texts. In relation to the findings, it was concluded that accuracy may have more of an impact on scores than constructs that are more explicit components of the score bands such as lexical complexity.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"67 ","pages":"Article 100995"},"PeriodicalIF":5.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145684369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-01-01Epub Date: 2025-11-29DOI: 10.1016/j.asw.2025.100992
Albert W. Li , Steve Graham
{"title":"How reliable and valid is peer evaluation in adolescents’ L2 argumentative writing?","authors":"Albert W. Li , Steve Graham","doi":"10.1016/j.asw.2025.100992","DOIUrl":"10.1016/j.asw.2025.100992","url":null,"abstract":"<div><div>Peer evaluation is widely recognized for its educational benefits; however, its reliability and validity, particularly among adolescent second-language (L2) writers at the early stages of English language and literacy development, remain insufficiently explored. This explanatory sequential mixed-methods study investigated the reliability and validity of peer evaluation in English argumentative writing among 35 Grade 10 and 37 Grade 12 students from a public high school in Beijing, China. Twelve of the participating students (six at each grade) were interviewed about the validity, reliability, and value of peer evaluation. The findings indicated that peer evaluations demonstrated high levels of reliability and validity, with peer-assessed writing scores closely aligning with inter-teacher assessments. Notably, variations were observed among Grade 10 students, particularly in the evaluation of lower-order writing skills, such as grammar and vocabulary, which exhibited reduced validity. These results underscore the potential of peer evaluation in assessing higher-order content-level writing across varying levels of L2 English writing proficiency. The study also highlights areas where adolescent L2 writers may require additional support to enhance the effectiveness of peer evaluation practices in English argumentative writing. Implications for improving English argumentative writing instruction and refining peer evaluation strategies in high school L2 English classrooms are discussed.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"67 ","pages":"Article 100992"},"PeriodicalIF":5.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145624789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-01-01Epub Date: 2026-01-19DOI: 10.1016/j.asw.2025.101011
Paul Deane, Andrew Hoang
{"title":"Extracting interpretable writing traits from a large language model","authors":"Paul Deane, Andrew Hoang","doi":"10.1016/j.asw.2025.101011","DOIUrl":"10.1016/j.asw.2025.101011","url":null,"abstract":"<div><div>Large language models (LLMs) are increasingly used to support automated writing evaluation (AWE), both for purposes of scoring and feedback. However, LLMs present challenges to interpretability, making it hard to evaluate the construct validity of scoring and feedback models. BIOT (best interpretable orthogonal transformations) is a new method of analysis that makes dimensions of an embedding interpretable by aligning them with external predictors. It was originally developed to improve the interpretability of multidimensional scaling models. However, This paper shows that BIOT can be used to align LLM embeddings with an interpretable writing trait model developed using multidimensional analysis of classical NLP features to measure latent dimensions of writing style and writing quality. This makes it possible to determine whether an AWE model built using an LLM is aligned with known (and construct-relevant) dimensions of textual variation, supporting construct validity. Specifically, we examine the alignment between the hidden layers of deBERTA, a small LLM that has been shown to be useful for a variety of natural language processing applications, and a writing trait model developed through factor analysis of classical features used in existing AWE models. Specific dimensions of transformed deBERTA layers are strongly correlated with these classical factors. When the transformation matrix derived using BIOT is applied to token vectors, it is also possible to visualize which tokens in the original text contributed to high or low scores on a specific dimension.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"67 ","pages":"Article 101011"},"PeriodicalIF":5.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Assessing WritingPub Date : 2026-01-01Epub Date: 2025-12-03DOI: 10.1016/j.asw.2025.100996
Xiaoxiao Chen , Xiaojun Pi
{"title":"Assessing EFL students’ GenAI-assisted writing: Teachers’ pains, perceptions and practices","authors":"Xiaoxiao Chen , Xiaojun Pi","doi":"10.1016/j.asw.2025.100996","DOIUrl":"10.1016/j.asw.2025.100996","url":null,"abstract":"<div><div>As generative artificial intelligence (GenAI) becomes increasingly embedded in EFL students’ writing practices, it has raised profound challenges to traditional assessment systems, calling into question the validity of established rubrics, the reliability of teacher judgments, and the authenticity of student performance. Despite this, limited research has explored how EFL teachers perceive and respond to GenAI-assisted writing. This qualitative case study investigates four EFL writing teachers’ assessment behaviors and reasoning across two stages of blind and informed evaluations of GenAI-assisted student texts. Drawing on think-aloud protocols and interviews, the study reveals that while the EFL writing teachers struggled to identify GenAI-generated content without prior knowledge, some exhibited noticeable adjustments in their scoring behaviors and assessment criteria once informed of GenAI involvement, whereas others remained relatively consistent. These changes were closely tied to their GenAI knowledge and experience and underlying pedagogical beliefs. The findings highlight divergent assessment strategies, ranging from product-oriented scoring to critical scrutiny of student-AI interaction, and underscore the salience of teacher agency in shaping responses to technological disruption. By uncovering the teachers’ cognitive processes, dilemmas, and reflective practices, the study contributes to a deeper understanding of assessment literacy in AI-mediated learning environments and offers practical implications for rubric adaptation, teacher professional development, and policy support.</div></div>","PeriodicalId":46865,"journal":{"name":"Assessing Writing","volume":"67 ","pages":"Article 100996"},"PeriodicalIF":5.5,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145684370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}