{"title":"大型语言模型能判断孩子的陈述吗?: ChatGPT与人类专家在可信度评估中的比较分析。","authors":"Zeki Karataş","doi":"10.1080/26408066.2025.2547211","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.</p><p><strong>Materials and methods: </strong>Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims (<i>N</i> = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.</p><p><strong>Results: </strong>A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing \"good\" to \"excellent\" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into \"poor\" or negative ranges (e.g. ICC = -.106 for \"Logical structure\"), indicating its evaluation logic fundamentally differs from expert reasoning.</p><p><strong>Discussion: </strong>The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a \"cognitive assistant\" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.</p>","PeriodicalId":73742,"journal":{"name":"Journal of evidence-based social work (2019)","volume":" ","pages":"1-16"},"PeriodicalIF":1.4000,"publicationDate":"2025-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can a Large Language Model Judge a Child's Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment.\",\"authors\":\"Zeki Karataş\",\"doi\":\"10.1080/26408066.2025.2547211\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.</p><p><strong>Materials and methods: </strong>Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims (<i>N</i> = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.</p><p><strong>Results: </strong>A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing \\\"good\\\" to \\\"excellent\\\" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into \\\"poor\\\" or negative ranges (e.g. ICC = -.106 for \\\"Logical structure\\\"), indicating its evaluation logic fundamentally differs from expert reasoning.</p><p><strong>Discussion: </strong>The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a \\\"cognitive assistant\\\" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.</p>\",\"PeriodicalId\":73742,\"journal\":{\"name\":\"Journal of evidence-based social work (2019)\",\"volume\":\" \",\"pages\":\"1-16\"},\"PeriodicalIF\":1.4000,\"publicationDate\":\"2025-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of evidence-based social work (2019)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1080/26408066.2025.2547211\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of evidence-based social work (2019)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/26408066.2025.2547211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Can a Large Language Model Judge a Child's Statement?: A Comparative Analysis of ChatGPT and Human Experts in Credibility Assessment.
Purpose: This study investigates the inter-rater reliability between human experts (a forensic psychologist and a social worker) and a large language model (LLM) in the assessment of child sexual abuse statements. The research aims to explore the potential, limitations, and consistency of this class of AI as an evaluation tool within the framework of Criteria-Based Content Analysis (CBCA), a widely used method for assessing statement credibility.
Materials and methods: Sixty-five anonymized transcripts of forensic interviews with child sexual abuse victims (N = 65) were independently evaluated by three raters: a forensic psychologist, a social worker, and a large language model (ChatGPT, GPT-4o Plus). Each statement was coded using the 19-item CBCA framework. Inter-rater reliability was analyzed using Intraclass Correlation Coefficient (ICC), Cohen's Kappa (κ), and other agreement statistics to compare the judgments between the human-human pairing and the human-AI pairings.
Results: A high degree of inter-rater reliability was found between the two human experts, with the majority of criteria showing "good" to "excellent" agreement (15 of 19 criteria with ICC > .75). In stark contrast, a dramatic and significant decrease in reliability was observed when the AI model's evaluations were compared with those of the human experts. The AI demonstrated systematic disagreement on criteria requiring nuanced, contextual judgment, with reliability coefficients frequently falling into "poor" or negative ranges (e.g. ICC = -.106 for "Logical structure"), indicating its evaluation logic fundamentally differs from expert reasoning.
Discussion: The findings reveal a profound gap between the nuanced, contextual reasoning of human experts and the pattern-recognition capabilities of the LLM tested. The study concludes that this type of AI, in its current, prompt-engineered form, cannot reliably replicate expert judgment in the complex task of credibility assessment. While not a viable autonomous evaluator, it may hold potential as a "cognitive assistant" to support expert workflows. The assessment of child testimony credibility remains a task that deeply requires professional judgment and appears far beyond the current capabilities of such generative AI models.